Supplementary note

“Supplementary note” published in The R Journal.

Bastien Chassagnol (LPSM) , Antoine Bichat (Les Laboratoires Servier) , Gregory Nuel (LPSM, UMR CNRS 8001) , Etienne Becht (Les Laboratoires Servier)
2023-04-28

For ease of reading, we show again in Table 1 the general configuration used to run all the benchmarks tested.

Table 1: Global options shared by all the benchmarked packages.

Initialisation methods

Algorithms

Criterion threshold

Maximal iterations

Number of observations

hc, kmeans, small EM,rebmix, quantiles, random

EM R, Rmixmod, bgmm, mclust, flexmix, EMCluster, mixtools, GMKMCharlie

10610^{-6}

1,000

100, 200, 500, 1000, 2000, 5000, 10000

1 Appendix A: In-depth statistical elements about parameters estimation in GMMs

Application of the EM algorithm to GMMs

While solving Equation (10) to retrieve the MLE estimates in the M-step of the EM algorithm, we have to enforce the non-negativity and sum-to-one constraint of the mixture models (Equation (2)). This is enabled by the tip, which consists in practice to add the equality constraint over the parameters to estimate, here \(-\lambda (\sum_{j=1}^k p_j -1)\), to the function to be optimised (Walsh 1975).

The evaluation of the roots of the derivative of the auxiliary function (see Equation (10)) at the parameter \(p_j\) with the additional unit simplex constraint (2) allows to readily compute a MLE estimate of the ratios, valid for any finite mixture model (Equation (1)):

\[\begin{equation} \hat{p_j}= \frac{\sum_{i=1}^n \eta_{i}(j)}{n} \tag{1} \end{equation}\]

Additionally, we restrained in both the univariate and multivariate settings to the fully unconstrained parametrisation, in which each component follows its own parametric distribution. The general derivative of the auxiliary function with respect to each component parametric distribution \(\zeta_j\), is given by Equation (2)1:

\[\begin{equation} \frac{\partial Q(\theta|\hat{\theta}_{q-1})}{\partial \zeta_j}=\sum_{i=1}^n \eta_{i}(j) \frac{\partial \log (f_{\zeta_j}(X_i|S_i=j))}{\partial \zeta_j} \tag{2} \end{equation}\]

Accordingly, if a closed form for the computation of the MLE in supervised cases is known (and fortunately this is the case for both the univariate and multivariate Gaussian distributions), the computation of the maximum of the auxiliary function can be readily calculated.

Plug-in the corresponding parametric distribution in the auxiliary function (10) yields the following formula for the univariate GMM (Equation (3)):

\[\begin{equation} Q(\theta| \hat{\theta}_{q-1}) = \sum_{i=1}^n \sum_{j=1}^k \eta_i(j) \left( \log (p_j) - \log (\sigma_j) - \frac{(X_i-\mu_j)^2}{2\sigma_j^2} \right) + K \tag{3} \end{equation}\]

and Equation (4) for the multivariate GMM:

\[\begin{equation} Q(\theta| \hat{\theta}_{q-1}) = \sum_{i=1}^n \sum_{j=1}^k \eta_i(j) \left[ \log (p_j) - \frac{1}{2} \left( \log(\operatorname{det}(\boldsymbol{\Sigma}_j)) + (x_i - \boldsymbol{\mu}_j)^\top \boldsymbol{\Sigma}_j^{-1}(x_i - \boldsymbol{\mu}_j)\right) \right] + K \tag{4} \end{equation}\]

\(K\) is a constant with respective values of \(\frac{-nD\log(2\pi)}{2}\) and \(\frac{-n\log(2\pi)}{2}\) in the univariate and multivariate setting.

In the univariate setting, the individual MLE mean \(\mu_j\), and variance, \(\sigma_j\), estimates are readily available (Equations (5) - (6)):

\[\begin{equation} \frac{\partial Q(\theta|\hat{\theta}_{q-1})}{\partial \mu_j} = 0 \Leftrightarrow \mu_j = \frac{\sum_{i=1}^n \eta_i(j) X_i}{\sum_{i=1}^n \eta_i(j)} \tag{5} \end{equation}\] \[\begin{equation} \frac{\partial Q(\theta|\hat{\theta}_{q-1})}{\partial \sigma_j} = 0 \Leftrightarrow \sigma_j^2 = \frac{\sum_{i=1}^n \eta_i(j) (x_i - \mu_j)^2 }{\sum_{i=1}^n \eta_i(j)} \tag{6} \end{equation}\]

Before finding the optimum of the auxiliary function in the multivariate setting, we remind the interested reader of some relevant calculus formulas below:

  1. \(\operatorname{det}(p\boldsymbol{A})=p^G \operatorname{det} (\boldsymbol{A})\)
  1. \(\operatorname{det}(A^{-1})=\frac{1}{\operatorname{det}(A)}\)
  1. \(\left(\boldsymbol{A}^{-1} \right)^\top=\boldsymbol{A}^{-1}\)2

Given a symmetric matrix \(\boldsymbol{A}\) of full rank \(D\) and two vectors \(\boldsymbol{x}\) and \(\boldsymbol{\mu}\) of size \(D\), the following derivative properties hold:

  1. \(\frac{\partial \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}}{\partial \boldsymbol{A}} = \boldsymbol{x} \boldsymbol{x}^\top\)
  1. \(\frac{\partial (\boldsymbol{x}-\boldsymbol{\mu})^\top \boldsymbol{A} (\boldsymbol{x}-\boldsymbol{\mu})}{\partial \boldsymbol{\mu}} = -2 \boldsymbol{A}(\boldsymbol{x}-\boldsymbol{\mu})\)
  1. \(\frac{\partial \log\left(\operatorname{det}(\boldsymbol{A})\right)}{\partial \boldsymbol{A}^{-1}} = -\boldsymbol{A}\) 3

Using the calculus formulas derived in the previous boxes, a closed form for the MLE estimate of the mean, \(\boldsymbol{\mu}_j\), and covariance, \(\boldsymbol{\Sigma}_j\), is readily computed (see Equations (7) - (8)):

\[\begin{equation} \frac{\partial Q(\theta|\hat{\theta}_{q-1})}{\partial \boldsymbol{\mu}_j} = \sum_{i=1}^n \eta_i(j) \boldsymbol{\Sigma}_j^{-1}(x_i - \boldsymbol{\mu}_j) =0 \Leftrightarrow \mu_j = \frac{\sum_{i=1}^n \eta_i(j) \boldsymbol{x}_i}{\sum_{i=1}^n \eta_i(j)} \tag{7} \end{equation}\] \[\begin{equation} \frac{\partial Q(\theta|\hat{\theta}_{q-1})}{\partial \boldsymbol{\Sigma}_j^{-1}}=\frac{1}{2} \sum_{i=1}^n \eta_i(j) \left[ \boldsymbol{\Sigma}_j - (x_i - \boldsymbol{\mu}_j)(x_i - \boldsymbol{\mu}_j)^\top \right]= 0 \Leftrightarrow \boldsymbol{\Sigma}_j = \frac{\sum_{i=1}^n \eta_i(j) (x_i - \boldsymbol{\mu}_j)(x_i - \boldsymbol{\mu}_j)^\top }{\sum_{i=1}^n \eta_i(j)} \tag{8} \end{equation}\]

Explicitly optimising the equations ((3)-(4)) yield the following MLE parameters in both the univariate and multivariate settings (Table 2), as detailed in (Leytham 1984; Redner and Walker 1984):

Table 2: An overview of the practical implementation of the EM algorithm in GMMs.

Univariate GMM

Multivariate GMM

ηi(j)=p^jqN(xiμ^jq,σ^jq)j=1kp^jqN(xiμ^jq,σ^jq)\eta_i(j) = \frac{\hat{p}_j^q \mathcal{N} (x_i|\hat{\mu}_j^q, \hat{\sigma}_j^q)}{\sum_{j=1}^k \hat{p}_j^q \mathcal{N} (x_i|\hat{\mu}_j^q, \hat{\sigma}_j^q)}

ηi(j)=p^jqND(xiμ^jq,Σ^jq)j=1kp^jqND(xiμ^jq,Σ^jq)\eta_i(j) = \frac{\hat{p}_j^q \mathcal{N}_D (\boldsymbol{x}_i|\boldsymbol{\hat{\mu}}_j^q, \boldsymbol{\hat{\Sigma}}_j^q)}{\sum_{j=1}^k \hat{p}_j^q \mathcal{N}_D (\boldsymbol{x}_i|\boldsymbol{\hat{\mu}}_j^q, \boldsymbol{\hat{\Sigma}}_j^q)}

p^jq+1=i=1nηi(j)n\hat{p}_j^{q+1}= \frac{\sum_{i=1}^n \eta_{i}(j)}{n}

μ^jq+1=i=1nηi(j)xii=1nηi(j)\hat{\mu}_j^{q+1} = \frac{\sum_{i=1}^n \eta_i(j) x_i}{\sum_{i=1}^n \eta_i(j)}

μ^jq+1=i=1nηi(j)xii=1nηi(j)\hat{\boldsymbol{\mu}}_j^{q+1} = \frac{\sum_{i=1}^n \eta_i(j) \boldsymbol{x}_i}{\sum_{i=1}^n \eta_i(j)}

(σ^j2)q+1=i=1nηi(j)(xiμ^jq+1)2i=1nηi(j)\left(\hat{\sigma}_j^2\right)^{q+1} = \frac{\sum_{i=1}^n \eta_i(j) \left(x_i - \hat{\mu}_j^{q+1}\right)^2 }{\sum_{i=1}^n \eta_i(j)}

(Σ^j2)q+1=i=1nηi(j)(xiμ^jq+1)(xiμ^jq+1)i=1nηi(j)\left(\boldsymbol{\hat{\Sigma}}_j^2\right)^{q+1} = \frac{\sum_{i=1}^n \eta_i(j) \left(\boldsymbol{x}_i - \hat{\boldsymbol{\mu}}_j^{q+1} \right)\left(\boldsymbol{x}_i - \hat{\boldsymbol{\mu}}_j^{q+1} \right)^\top }{\sum_{i=1}^n \eta_i(j)}

In both cases, obtaining the parameters of each component’s parametric distribution turn to be equivalent to the computation of the mean and variance of a weighted sample, which can be computed in R with stats::weighted.mean and stats::cov.wt functions4. Importantly, the value of the mapping function only depends on the set of the observations \(X\), but does not depend on the parameter to estimate \(\theta\). Indeed, the statistic computed by the EM algorithm is sufficient, which is one of its main advantages.

The complete code associated to our R implementation is implemented respectively with enmix_univariate and enmix_bivariate for the univariate and multivariate setting, available on GitHub at RGMMBench, as well as the programs used to generate the several plots and tables of the article. We additionally made two choices not clearly set in the literature:

\[\begin{equation} \begin{split} \ell (\theta|x) &= \log (\sum_{j=1}^k p_j f{\zeta_j} (x)) \\ & = \log \left( \exp \left[ \log(p_j) + \log(f{\zeta_j} (x)) \right]\right) \quad \end{split} \tag{9} \end{equation}\]

Second, we rewrite our sum of exponentials, the one enclosed into the log, to use the Taylor’ series of \(\log(1+x)\), with \(|x|\ll 1\), see Equation (10):

\[\begin{equation} \begin{split} \log \left( \sum_{j=1}^k e^{a_j}\right)&=\log \left( \exp(a_j') \times \left[ 1+ \sum_{j\neq j'}\exp\left(\frac{a_j}{a_{j'}}\right) \right] \right) \\ & = a_{j'} + \operatorname{log1p} \left(\sum_{j\neq j'}\exp\left(\frac{a_j}{a_{j'}}\right)\right), \, \text{ with } j'=\arg \max_{\forall j \in \{1, \ldots, k\}} (e^{a_j}) \end{split} \tag{10} \end{equation}\]

with base::log1p() the R function dedicated for this Taylor’s development. The posterior probabilities are then given by Equation (11):

\[\begin{equation} \log\left(\mathbb{P}_\theta(S=j | X=x)\right)=\log(p_j) + \log(f_{\zeta_j}) - \ell (\theta |x) \tag{11} \end{equation}\]

Parsimonious parametrisation of multivariate GMMs

Parsimonious parametrisation of GMMs models are provided by the following eigenvalue factorisation of the covariance matrix (Equation (12)):

\[\begin{equation} \boldsymbol{\Sigma}_j=\lambda_j \boldsymbol{Q}_j \boldsymbol{D}_j \boldsymbol{Q}_j^\top \tag{12} \end{equation}\]

with \(\lambda_j=\operatorname{det}\left(\boldsymbol{\Sigma}_j\right)^{\frac{1}{D}}\) a scalar proportional to the total volume of the ellipsoid (or area in bidimensional setting), \(\boldsymbol{D}_j\) a diagonal matrix storing the eigenvalues normalised such that \(|\boldsymbol{D}_j|=1\)5 and \(\boldsymbol{Q}_j\) a \(\mathcal{M}_D(\mathbb{R})\) orthogonal matrix whose columns are \(D\) linearly independent eigenvectors generating an orthonormal basis in \(\mathbb{R}^D\) while \(\boldsymbol{Q}_j^\top\) is its corresponding transpose matrix. The existence of the decomposition is guaranteed by the positive definiteness constraint over the covariance matrix while the orthogonality of \(\boldsymbol{Q}_j\) results from its symmetry. When the matrix to factorise is positive-definite and symmetric, we also refer to it as spectral decomposition, a special case of eigendecomposition.

Each of these matrices can be constrained to be equal or variable across clusters, hence this decomposition reveals 14 possible models with different geometric characteristics, namely:

We detail the main characteristics of the 14 parametrisations (28 if we add for each model the equiproportional hypothesis) in Table (3):

Table 3: The 14 canonical parametrisations of the within-group covariance matrix \(\boldsymbol{\Sigma_j}\) with the corresponding geometric representations.
Model Notation Family M-step Number of parameters Representation
EII \([\lambda I]\) Spherical CF \(\alpha + 1\)
VII \([\lambda_j I]\) Spherical CF \(\alpha + k\)
EEI \([\lambda D]\) Diagonal CF \(\alpha + d\)
VEI \([\lambda_j D]\) Diagonal IP \(\alpha + d + k - 1\)
EVI \([\lambda D_j]\) Diagonal CF \(\alpha + kd - k + 1\)
VVI \([\lambda_j D_j]\) Diagonal CF \(\alpha + kd\)
EEE \([\lambda Q D Q^\top]\) Ellipsoidal CF \(\alpha + \beta\)
EVE \([\lambda QD_j Q^\top]\) Ellipsoidal IP \(\alpha + \beta\)
VEE \([\lambda_j Q D Q^\top]\) Ellipsoidal IP \(\alpha + \beta + (k-1)(d-1)\)
VVE \([\lambda_j QD_j Q^\top]\) Ellipsoidal IP \(\alpha + \beta + d(k-1)\)
EEV \([\lambda Q_j D Q_j^\top]\) Ellipsoidal CF \(\alpha + k \beta - d(k-1)\)
VEV \([\lambda_j Q_j D Q_j^\top]\) Ellipsoidal IP \(\alpha + k \beta - (k-1)(d-1)\)
EVV \([\lambda Q_j D_j Q_j^\top]\) Ellipsoidal CF \(\alpha + k \beta - k + 1\)
VVV \([\lambda_j Q_j D_j Q_j^\top]\) Ellipsoidal CF \(\alpha + k \beta\)

Parameters estimation in a high-dimensional context

However, while parsimonious representations can largely reduce the computational burden, none of them in the general family is able to handle degenerate cases where the number of features, \(D\), exceeds the number of observations \(n\). Likewise situations, when the number of features is consequent, are referred to as high-dimensional, raising the well-known issue of the “curse of the dimensionality”. Two distinct approaches have been developed in the literature to handle these degenerate cases:

When the sub dimension \(d_j\) is known, a closed version is generally available for the M-step of the EM algorithm, however \(d_j\) is itself an hyperparameter to estimate. Though, (Bouveyron et al. 2011) has shown that a classical Cattell’s scree-test could be used to asymptotically estimate the intrinsic dimension of each cluster. Compared to the previous approach, this method has a strong theoretical background and strong impact on the running times performance.

Taking a concrete use case from the help documentation of the package HDclassif, it enabled to cluster a dataset of 10 classes with 130 observations overall and described in a 1024-dimensional space (consider the famous machine-learning digit recognition problem). Variants of these approaches have been developed in the following packages: HDclassif, fabMix, EMMIXmfa and pgmm. We refer the interested reader to the educational vignette of HDclassif: https://rdrr.io/pkg/HDclassif/man/HDclassif-package.html and papers (McNicholas and Murphy 2008; McNicholas et al. 2010; McNicholas and Murphy 2010).

Historically, the first mention of a probabilistic framework with an application to dimension reduction in the context of finite mixture models goes back to Tipping and Bishop (1999), based on principal component analysis. McLachlan et al. (2003) and Mclachlan and Peel (2000) extend this original model by postulating that the distribution of the data within any latent class could be described using the tools of the factor analysis field7 Finally, building on the parsimonious parametrisations already theorised for GMMs (see previous section) , McNicholas and Murphy (2008), McNicholas et al. (2010) and Bouveyron et al. (2007) proposed a variety of constraints, but this time directly defined on the projected subspace. Since all methods based on factor analysis provide a transition matrix, using the two or three most informative eigen values and their associated eigen vectors in order to project the dataset on a smaller subspace provides a simple visualisation tool for representing high dimensional datasets. However, this method may is not suitable for unravelling the clustering structure. Instead, the GMMDR method, first proposed by Scrucca (2010) and implemented in the mclust::MclustDR() function, from mclust package, aims at recovering the subspace that best captures the underlying latent clustering structure (we notably expect invariance of the global overlap in the sampling space and the corresponding projected subspace). More precisely, the main objective of the GMMDR technique is to infer the global change-of-basis matrix \(\boldsymbol{Q}\) that minimises the differences in the a posteriori probabilities of assigning each observation \(i\) to a given cluster \(s_i\), knowing the value of the vector of observed covariates \(\boldsymbol{x}_i\). Namely, we are looking for the orientation matrix \(\boldsymbol{Q}\) that maximally ensures the following objective (Eq. (13)):

\[\begin{equation} \hat{\boldsymbol{Q}} = \arg \max_{\boldsymbol{Q}} \, \left(\mathbb{P}_{\theta} (S_i=j | \boldsymbol{X} =\boldsymbol{x}_i) = \mathbb{P}_{\theta} (S_i=j | \boldsymbol{X} \boldsymbol{Q}) \right) \text{ such that } S \perp \boldsymbol{X} | \boldsymbol{X} \boldsymbol{Q} \tag{13} \end{equation}\]

This procedure itself derives from the sliced inverse regression algorithm (Li 1991), but instead of conditioning on the known response variable, GMMDR conditions on the estimated MAP cluster assignments. Since the solution returned by the following optimization problem is not unique, we generally constrain the projection matrix to be orthonormal (any of the vectors forming the basis are pairwise orthogonal, and individually of norm 1).

Model selection

When comparing several models with several number of components or parametrisations, the likelihood is uninformative as it can be arbitrarily minimised by increasing the complexity of the model or adding components. it is then necessary to penalise for complexity when comparing them. The general form of the penalty metric, GIC (for generalised information criteria), is given by Equation (14):

\[\begin{equation} \operatorname{GIC} (\theta) = \underbrace{p(\theta)}_{\text{penalty term}} - \underbrace{2 \ell (X | \theta)}_{\text{log-likelihood of the model}} \tag{14} \end{equation}\]

Among them, we set apart scores focused on selecting selecting the right number of parameters and components, namely the degrees of freedom (d.o.f.) of the model (\(3k-1\) parameters for the univariate unconstrained GMM), and those focusing on retrieving readable clusters.

In the first category, the AIC (Akaike information criterion) (Schwarz 1978) is a minimax-rate optimal (score that minimises the risk in the worst case) but inconsistent metric (Yang 2005) , proned to overestimate the true number of components. BIC (Bayesian Information Criterion), and CAIC (consistent AIC), accounting for both the number of parameters and the sample size, are consistent metrics. Finally, the MDL(Minimum Description Length) criterion accounts for the number of parameters, sample size and number of components. Its core objective differs from the others as it aims at reducing the amount of code to encode both parameters and observations but is practically close to the BIC metric. A thorough description of these scores, with their formulas and theoretical properties, can be found in Fonseca (2008), Celeux et al. (2018).

In the second category, the most commonly implemented is the ICL (integrated complete-data likelihood), a BIC criterion with an additional entropy penalty (McLachlan and Peel 2000). As opposed to BIC, the entropy term reduces the number of components to a well-separated and readable clustering. Hence, it tends to underestimate their true number when components are overlapping. Alternative similar metrics are the CLC (Classification Likelihood Criterion), AWE (Approximate Weight of Evidence) and NEC (Normalised Entropy Criterion) metrics (Bacci et al. 2012). The several metrics implemented by the reviewed packages are listed in Table 2.

The Likelihood-ratio test (LRTS) can also be used to compare nested models, with additional advantage to possibly derive a p-value yielding the probability that a complex model (with more components) should preferentially be used over a simpler one. Traditionally, common process is to add one component after the other, until hypothesis \(H_0\) can not be rejected anymore. Under standard regularity conditions of Cramer’s theorem, Wilk’s theorem states that the Likelihood Ratio distribution follows asymptotically a \(\chi^2\) distribution, but unfortunately these conditions are not met in mixture models (McLachlan and Peel 2000). To counterbalance it, bootstrap inference (McLachlan and Peel 2000) is often used to derive an empirical distribution of the Likelihood Ratio.

Derivation of confidence intervals in GMMs

Punctual estimation, with a single estimate \(\hat{\theta}\) for a given n-sample, is not enough to evaluate the performance of a specific method, as drawing another n-sample using the same parameters is likely to lead to a different distribution and estimation of \(\hat{\theta}\). Instead, it can be interesting to retrieve the distribution or at least the variability of the estimated parameters, which can reveal useful to derive confidence intervals. However, obtaining the distribution or even an asymptotic approximation of the distribution of the parameters is not feasible in practice with mixture models (McLachlan and Peel 2000). Hence, most authors recommend to use bootstrap methods for the generation of confidence intervals, as suggested in (Efron and Tibshirani 1993; Basford et al. 1997).

Bootstrap distributions of the parameters are generally retrieved via empirical or parametric bootstrap, both available in the mclust package. In the empirical or non-parametric bootstrap Jaki et al. (2018), we draw iteratively \(N\) samples of size \(n\) with replacement from the original observed variable \(x_{1:n}\). In the parametric bootstrap, \(N\) simulations are built from the parameter estimated with the available observations of \(X\), via the EM algorithm or any method used for parameter estimation. In both cases, we obtain an empiric distribution of the parameter estimate: \(\hat{\theta}_{1:N}= (\hat{\theta}^1, \ldots, \hat{\theta}^N)\). Sample mean and standard deviation (SD) of this empirical distribution can be used to retrieve an asymptotic estimate of the variability of the parameter estimate \(\hat{\theta}\), the bias or the MSE of the parameter estimates. To get unbiased estimates of the true standard deviation and mean of the estimates, it is of common practice to compute the empirical covariance matrix of the sample \(\operatorname{cov} [\hat{\theta}]= \frac{\sum_{j=1}^N (\hat{\theta_j} - \mathbb{E} [\hat{\theta}]) (\hat{\theta_j} - \mathbb{E} [\hat{\theta}])^\mathrm{T}}{N - 1}\), the square roots of its diagonal terms corresponding to the empiric SDs. Symmetric \(1-\alpha\) asymptotic confidence intervals using the Central Limit Theorem (CLT) can then be simply derived Equation (15):

\[\begin{equation} \mathbb{E} [\hat{\theta}_t] \pm \frac{1}{\sqrt{n}}z_{1 - \frac{\alpha}{2}} \sqrt{\operatorname{var}(\hat{\theta}_t}, \quad \forall t \in \{1, \ldots, 3k\} \tag{15} \end{equation}\]

with \(z_{1 - \frac{\alpha}{2}}\) the \(1 - \frac{\alpha}{2}\) quantile of the standard Gaussian distribution.

If computing the covariance matrix is not possible analytically, it can be approximated by the expected Fisher Information Matrix \(\mathcal{I}_{\exp}(\theta)\) (FIM), given by Equation (16):

\[\begin{equation} \left[\mathcal{I}_{\exp}(\theta)\right]_{1\le i \le 3k,1 \le j \le 3k} = - \mathbb{E} \left[\frac{\partial^2}{\partial \theta_i \partial \theta_j} \ell(\theta|X)\right] \tag{16} \end{equation}\]

Indeed, the Cramér-Rao theorem states that the diagonal elements of the inverse of the FIM are upper bounded by the variability of the parameters: \(\operatorname{var}(\hat{\theta}) \ge \frac{1}{\mathcal{I}(\theta)}\). This implies that the ratio between inverse of the FIM and the variance \(e(\hat{\theta})=\frac{\mathcal{I}(\hat{\theta})^{-1}}{\operatorname{var}(\hat{\theta})}\) converges to 1, using the asymptotic efficiency of the MLE estimate of GMMs.

Unfortunately, the computation of the expected FIM is still a hard task. Hence it is generally replaced by the observed FIM, the negative of the Hessian matrix of the incomplete log-likelihood function: \(\mathcal{I}_\text{obs}(\theta)= -\frac{\partial^2}{\partial \theta_i \partial \theta_j} \ell(\theta|X)\). Exact general formulas are provided for the univariate case in Louis (1982) and for the multivariate case in Oakes (1999). Yet, it has to be noted that the expected FIM generally outperforms the observed FIM in estimating the covariance matrix of the MLE (Cao and Spall 2012).

However all these methods require to compute second derivatives of the log-likelihood leading to some disadvantages from a computational point of view. More recently, Meng (2016) and Delattre and Kuhn (2019) proposed an accelerated algorithm requiring only computation of first order derivatives. A similar alternative is implemented in the mixsmsn package (Prates et al. 2021a): mixsmsn::im.smsn(), in which the Hessian matrix is approximated by the cross-product of the gradient of the log-likelihood Equation (17):

\[\begin{equation} \mathcal{I}_{\text{obs}}(\theta) \approx - \frac{\partial \log (\ell (\theta|X))}{\partial \theta} \frac{\partial \log (\ell (\theta|X))}{\partial \theta}^T \tag{17} \end{equation}\]

according to an idea developed in paper Basford et al. (1997). For a more general introduction to Gaussian mixtures, including other models and parametrisations in the multivariate case, we refer the reader to the reference book Gaussian parsimonious clustering models Celeux and Govaert (1992).

An analytic formula of the overlap for univariate Gaussian mixtures

From an analytic point of view, the overlap between \(k\) components of variable \(X\) is given by Equation (18):

\[\begin{equation} \operatorname{OVL} (X)= 1 - \int_{\mathbb{R}} \max_j (p_j \varphi_{\zeta_j} (x)) dx \tag{18} \end{equation}\]

The 1 in Equation (18) corresponds to the integration of probability \(f_{\theta}(X)\) distribution over its domain. The second part is the area under the curve of the component density function maximised on \(\mathbb{R}\), with \(j\) the index of the component maximised at that point. It should be noted that the definition used here for the overlap is closely related to the definition of the false clustering rate (FCR) (Marandon et al. 2022).

Equation (18) simplifies for a two component mixture distribution to Equation (19):

\[\begin{equation} \operatorname{OVL} (X)= \int_{\mathbb{R}} \min \left(p_1 \varphi_{\zeta_1} (x), p_2 \varphi_{\zeta_2} dx (x)\right) \tag{19} \end{equation}\]

From a probabilistic point of view, we can rewrite Equation (19) as the overall probability of assigning a wrong label to a given observation. With two components, this simply decomposes as the sum of the probability of mistakenly assigning an observation from component 2 to component 1 and the probability of assigning an observation from component 1 to component 2 Equation (20):

\[\begin{equation} \begin{split} \operatorname{OVL} (1, 2) & = \operatorname{OVL} (1 | 2) + \operatorname{OVL} (2 | 1) \\ & = \mathbb{P}\left(p_1 \varphi (X, \mu_1, \sigma_1) \le p_2 \varphi (X, \mu_2, \sigma_2) \right) + \mathbb{P}\left(p_2 \varphi (X, \mu_2, \sigma_2) \le p_1 \varphi (X, \mu_1, \sigma_1)\right) \\ & = \int_{\mathbb{R}} p_1 \varphi_{\zeta_1} (x) \mathrm{1}_{p_1 \varphi_{\zeta_1} \le p_2 \varphi_{\zeta_2}} dx + \int_{\mathbb{R}} p_2 \varphi_{\zeta_2} (x) \mathrm{1}_{p_2 \varphi_{\zeta_2} \le p_1 \varphi_{\zeta_1}} dx \end{split} \tag{20} \end{equation}\]

We illustrate the computation of the overlap in some hard-hitting cases below, showing relation between the level of entropy and the individual standard deviations with the overlap measured in Figure 1. Means of component 1 and 2 are 5.28 and 8.45. Panels A and C correspond to balanced classes, while in panel B and D, class 1 is more abundant with a frequency of 0.9. Finally, in panels A and B, the variance of component 1 is smaller than the variance of component 2 with respective SDs of 1 and 3 and reciprocally for panels B and D. Interestingly, in panel D, using the MAP as defined in Equation (21), all observations issued from class 2 are wrongly assigned to class 1.

\[\begin{equation} \eta_{i} (j) := \mathbb{P}_{\theta} (S_i=j |X_i=x_i) \tag{21} \end{equation}\]

The red area corresponds to the probability of misclassifying component 1 as component 2, while the green area corresponds to the probability of misclassifying component 2 as component 1. Total overlap is since the sum of red and green area, in Figure 1.

Illustration of the overlaps between a two-components GMM. Density function of component 1 is given by the red line, its of component 2 by the green line, and total density function p<sub>θ</sub>(<i>X</i>) is represented in blue. The total overlap is given by the sum of the green and red areas.

Figure 1: Illustration of the overlaps between a two-components GMM. Density function of component 1 is given by the red line, its of component 2 by the green line, and total density function pθ(X) is represented in blue. The total overlap is given by the sum of the green and red areas.

There are two intersection points, \(x_1\) and \(x_2\) , with \(\mu_1 < \mu_2\) when solving equation Equation (22):

\[\begin{equation} p_1 \varphi(x, \mu_1, \sigma_1) = p_2 \varphi(x, \mu_2, \sigma_2) \tag{22} \end{equation}\]

in following case: if \(\sigma_2 > \sigma_1\), then we must have \(p_1 > \frac{\sigma_1}{\sigma_1 + \sigma_2}\), else if \(\sigma_2 < \sigma_1\), then \(p_1 < \frac{\sigma_1}{\sigma_1 + \sigma_2}\). In that case, they are given by following formula Equation (23):

\[\begin{equation} \begin{split} (x_1, x_2) = \left(\frac{\sigma_1^2 \mu_2 - \sigma_2^2 \mu_1 \pm \sigma_1 \sigma_2 \sqrt{(\mu_1 - \mu_2)^2 + 2(\sigma_2^2 - \sigma_1^2)\left[ \log (\frac{p1}{p2}) + \log (\frac{\sigma_2}{\sigma_1})\right]}}{\sigma_1^2 - \sigma_2^2}\right) \end{split} \tag{23} \end{equation}\]

Again, sign of term \(A\) and order of the roots yield two several cases, depending whether \(\sigma_1\) is greater or not than \(\sigma_2\). Both situations with unbalanced classes are illustrated in panel B and D on Figure 1:

\[\begin{equation} \begin{split} \operatorname{OVL} (1, 2) = p_1 \left(\Phi(x_2, \mu_1, \sigma_1) +1 - \Phi(x_1, \mu_1, \sigma_1) \right) + p_2 \left( \Phi(x_1, \mu_2, \sigma_2) -\Phi(x_2, \mu_2, \sigma_2) \right) \end{split} \tag{24} \end{equation}\] \[\begin{equation} \begin{split} \operatorname{OVL} (1, 2) = p_2 \left(\Phi(x_1, \mu_2, \sigma_2) +1 - \Phi(x_2, \mu_2, \sigma_2) \right) + p_1 \left( \Phi(x_2, \mu_1, \sigma_1) -\Phi(x_1, \mu_1, \sigma_1) \right) \end{split} \tag{25} \end{equation}\]

An interesting result is obtained with the homoscedascity and balanced classes’ assumptions of the k-means algorithm. There is only one intersection point in that case: \(x_c= \frac{\mu_1 + \mu_2}{2}\), that is simply the centre of the segment bounded by the means of the two components. The overlap is simply then \(\operatorname{OVL} (1, 2) = 2 \Phi (- \frac{|\mu_1 - \mu_2|}{2\sigma})\).

To our knowledge, no closed formula has been determined returning the overlap generalised to more than two components (combinatorial set of inequations to solve), in the unconstrained multivariate setting (cubic equation to solve in bidimensional space). Indeed, even restraining the study to the bivariate setting (the calculation of the OVL then amounts to estimating the zone of intersection between two ellipses), the exact computation of the OVL involves multiple integration and the algebraic resolution of a quartic equation. A first step is provided by (Alberich-Carramiñana et al. 2017), stating algebraic conditions for the existence of an intersection region and computing where applicable a closed formula of the OVL between two coplanar ellipses.

Accordingly, only stochastic approximations, relying on randomised algorithms, such as the Monte-Carlo integration with a rejection technique (knowing that the total area under the curve is normalised to one, we randomly simulate observations and the ratio of the number of observations falling in the intersection area is then used as a proxy of the overlap), are available so far (Maitra and Melnykov 2010; Nowakowska et al. 2014; Pastore and Calcagnì 2019).

2 Appendix B: Extensions of the EM algorithm to overcome its limitations

Two main alternatives were developed in parallel to the EM algorithm and are implemented in some of the reviewed packages: the CEM and the SEM algorithm. However, they do not have its theoretical properties, especially guarantee of the consistency of the algorithm.

The M-step of the classification EM (CEM) algorithm (Biernacki et al. 2000) maximises a function where each observation was assigned to the maximum a posteriori (MAP) estimate Equation (21). It generalises the well-known k-means algorithm making no assumption of homoscedascity or equibalanced clusters. Its main drawback is to not take into account uncertainty of the cluster assignment, inducing inconsistency of the algorithm (McLachlan and Peel 2000). EM*, referred in Kurban et al. (2017) and implemented in the DCEM package, is a faster implementation of the CEM algorithm, with roughly a twice smaller complexity. To do so, only the posterior distributions associated to the lower half of the most uncertainly assigned observations are re-computed in the E-step of the EM-algorithm. This normally avoids to recompute data that is unlikely to change of cluster attribution from an iteration to another. However, the higher speed of this algorithm has not been theoretically proven, as the gain of running time per iteration of the algorithm may be alleviated by a greater number of steps to reach the convergence.

The Stochastic EM (SEM) replaces the MAP value for \(S\) in the E-step of the CEM algorithm by a random draw (or \(N\) of them in the N- variant of the algorithm) of the posterior distribution \(\mathbb{P}_\theta (S|X)\). As this algorithm does not converge to a unique solution, but rather oscillates around a local maximum, the estimation is usually performed by averaging the late estimated values while ignoring the first estimates from the burn-in phase. A theoretical description of these algorithms, with discussion on their convergence properties, is detailed in Celeux and Govaert (1992). SEM algorithm has also a relatively faster convergence than EM algorithm but it is more proned to be trapped in a local maximum or to remove a component. Increasing the number of draws \(N\) may alleviate this issue, but at the extent of computational performances.

A wide variety of fast algorithms derived from the EM algorithm have been developed. cwEMM (component-wise EM algorithm), described in Celeux et al. (2012), is a variation of the EM algorithm aiming at speeding up its convergence. The M-step at each iteration is only performed for one of the components \(\theta_j=(p_j, \mu_j, \sigma_j)\), implying that the parameters of a given component are estimated sequentially rather than simultaneously. The theory behind relies on a Gauss-Seidel scheme and was first used by the SAGE algorithm. However, the constraints on the proportions set in Equation (2) are only guaranteed if the algorithm converges. Additionally, faster convergence is not theoretically proven for any situation. A list of general acceleration methods for the EM algorithm, not specific to GMMs, is available on turboEM (Bobb and Varadhan 2021).

Other EM-inspired algorithms focus on counterbalancing the main limitations of the EM algorithm. The Variational Bayesian EM (VBEM) algorithm performs a Bayesian estimation of the parameters. Indeed, the large space of all possible parameter estimates \(\Theta\) can be hard to explore and the usual initialisation methods are uninformative, not taking into account expert recommendations. VBEM uses these prior assumptions on the parameters’ distribution \(\mathbb{P} (\theta)\) to optimise the posterior distribution \(\mathbb{P} (\theta|X)\), based on Bayes’ rule. Direct determination of the Bayesian posterior law of the parameters is generally an intractable problem, hence Variational Bayes only maximises an approximation of the true posterior, assuming that the parameters can be partitioned in independent distributions. This hypothesis is known as mean-field approximation (Murphy 2012).

The minimum message length (MML) EM algorithm, implemented in the GMKMcharlie package, is a completely unsupervised algorithm as it does not require any prior selection of the number of components (Figueiredo and Jain 2002) , by dealing explicitly with the possibility of discarding a component during the iteration. To do so, the selection criteria for the number of components is directly included in the optimisation procedure. However, its implementation is close from a Bayesian estimation of the parameters of the model, setting a non-informative Dirichlet prior distribution on the ratios and the higher expected performances of the algorithm are not demonstrated on real use cases (Figueiredo and Jain 2002).

The Expectation/Conditional Maximisation (ECM) (MENG and Rubin 1993) belongs to the super-family of GEM (general EM) algorithms, generally used when the maximisation of the auxiliary function yields a non-closed form to solve. To do so, the ECM algorithm replaces the intractable M-step of the EM algorithm by a number of computationally simpler conditional maximization (CM) steps (instead of inferring all parameters at once, the conditional step retrieves a set of optimal parameters, conditioned by the current value of the others). ECM is for instance used with GMMs including an additional linear constraint on the means of the components, as provided by the mixtools package. As documented in Table 2, EMMIXmfa implements an extension of the ECM, termed alternating expectation–conditional maximization (AECM) algorithm (Meng and Van Dyk 1997), and which can be used to reduce the computational burden of estimating the parameters of mixtures of factor analysers. The AECM algorithm is an extension of the ECM algorithm that allows the complete data used for estimation to differ on each CM-step (generally, in order to speed the computation, by selecting a subset of the most leveraged observations). GEM algorithms share the same asymptotic theoretical properties of the EM algorithm, especially the local consistency of the estimates returned.

A small simulation to evaluate the impact of outliers

Classical methods used for the parameters’ estimation, especially the maximum likelihood estimation (MLE), are sensitive to the presence of outliers. A naive solution consists in assigning null weights to observations suspected to be outliers, so that they do not contribute 8. Trimming aberrant observations from the distribution is justified theoretically by the principle of the spurious outlier model (Gallegos and Ritter 2005). However, this method is quite stringent, requiring human expertise or the use of general outlier detection tools not necessarily adapted to GMM estimation.

Two general approaches for dealing with outliers with a well-defined theoretical background are the outliers mixture modelling and the trimming approach. Outliers mixture modelling integrate an additional component accounting for the outliers in the distribution. Notably, the mclust (Fraley et al. 2022) and otrimle (Coretto and Hennig 2021) packages use an improper uniform distribution to model the distribution of outliers. Unlike mclust, the otrimle package does not require the user to set in advance the proportion of outliers in the mixture (Coretto and Hennig 2016). As opposed, in the trimming approach, outliers are first removed before the complete estimation of parameters. Such methods are implemented in tclust (Iscar et al. 2022) and oclust (Clark and McNicholas 2019) packages.

tclust (Iscar et al. 2022) uses a robust constrained clustering method, where the user has to set an upper threshold to the ratio between the highest and the lowest variability among all components and a trimming ratio \(\alpha\). It extends the work of García-Escudero et al. (2008), with released constraints on the Gaussian distribution. First, the maximal degree of affinity, defined in Equation (26):

\[\begin{equation} D(x_i|\theta)=\max_j \left(p_{j} \varphi_{\zeta_j} (x_i) \right) \tag{26} \end{equation}\]

is computed for each observation \(x_i\), and corresponds for each point to the maximum probability to observe it in the distribution, given parameter \(\theta\). Then, \(\alpha\) observations the least likely to be observed are trimmed for the estimation of the parameters. When we reach convergence of the estimated parameter and there is no change in the outliers identification from one iteration to another, the iterative algorithm stops. The use of constraints is an additional feature that avoids building over-dispersed or unbalanced clusters, the highest constraint of a ratio of 1 yielding clusters with equal sizes. However, the identification of an observation as aberrant is highly dependant on the variability constraint and the determination of these two hyperparameters is complex and highly dependant on the shape of the distribution. Additionally, a CEM algorithm is used to retrieve the parameters and the proportion of outliers, for which the MLE, in contrast to the EM algorithm, is not asymptotically consistent nor efficient.

Unlike tclust, oclust (Clark and McNicholas 2019) both retrieves the proportion of outliers and identifies them. To do so, it compares the complete log-likelihood of the mixture \(\ell(\theta|X)\) with its value removing one observation \(\ell(\theta | X \setminus X_i)\), for all observations. Observations are iteratively removed, based on the assumption that the Kullback-Leibler divergence between the original log-likelihood and the trimmed log-likelihood \(\text{KL}\left(\ell(\theta|X)|| \ell(\theta | X \setminus X_i)\right)\) follows a Beta distribution. At each step, the observation that maximises the Kullback-Leibler divergence at a statistically significant threshold is removed. The algorithm stops trimming outliers, when this measure is not anymore statistically significant. However, the assumption of a Beta distribution only holds asymptotically and with non-overlapping clusters.

To integrate the impact of outliers in the estimation, we simulated a two-components GMM with well-separated and balanced clusters. The outliers distribution, corresponding to the additional noise component, was retrieved by randomly selecting prop.outliers points out of the total number of observations and drew their values from an uniform distribution bounded by an interval five times as big as the 0.05 and 0.95 quantiles of \(f_\theta(X)\). All estimates were obtained comparing the five reviewed initialisation methods, except with otrimle which has its own hierarchical clustering initialisation method.

A) Execution times for the nine reviewed packages using hierarchical clustering initialisation, with on the left $2\%$ of outliers in proportion and on the right, $4\%$ of outliers.  B) and C) Boxplots of the estimated parameters with $N=200$ repetitions, $n=2000$ observations and respectively $2\%$ and $4\%$ of outliers. The red dashed horizontal line corresponds to the true value of the parameters.

Figure 2: A) Execution times for the nine reviewed packages using hierarchical clustering initialisation, with on the left \(2\%\) of outliers in proportion and on the right, \(4\%\) of outliers. B) and C) Boxplots of the estimated parameters with \(N=200\) repetitions, \(n=2000\) observations and respectively \(2\%\) and \(4\%\) of outliers. The red dashed horizontal line corresponds to the true value of the parameters.

The slowest package is otrimle, most of the time being taken by the initialisation step where proportion and identification of the outliers is performed. Running times of the other packages are generally not impacted by the presence of outliers.

Most of the reviewed packages, except the bgmm package, are not impacted by the choice of initialisation method. Additionally, the proportions are rather correctly estimated (related to the choice of an uniform distribution to model outliers), but the reviewed packages tend to overestimate the true variability of each component, with the worst results obtained with rebmix initialisation. bgmm sets apart from the others by its reduced bias on the means and standard deviations estimated, a feature left undocumented. However, increasing the number of outliers (Figure 2, panel C) lead also to biased estimations for bgmm, while otrimle, a dedicated package, is still able to correctly estimate the individual parameters of the components’ distributions with a high proportion of outliers. Yet, analysing the code used to implement the bgmm reveals that there is no dedicated feature to remove outliers but rather a specific method used to deal with numerical underflow that artificially increases the probability of observing outlying distributions (EM-implementation differences across reviewed packages).

3 Appendix C: the meta-analysis workflow for the final selection of CRAN and Bioconductor platforms

General workflow

Table 4 lists the terms used in the search, the number of packages returned by the search, the number of packages excluded from review after the search, and the names of the packages ultimately selected for review. Indeed, the CRAN and Bioconductor platforms are the two most popular repositories for R packages, with a constraining review before publication.

Most packages we excluded from review did not focus on the GMM model, but on supplying tools for visualising and asserting the quality of a given clustering. For instance, the search term “cluster” returned many packages implementing other unsupervised clustering methods, such as k-means, KNN or graph clustering, were specifically dedicated to specific data, such as single cell analyses. The search term “mixture” returned either packages dealing only with non-Gaussian components, such as fitmix with log-normal distributions or were dedicated to chemical mixture designs.

Table 4: Meta-analysis summary about the selection of packages implementing the estimation of GMMs, on CRAN and Bioconductor.

Platforms

Searched terms

Number of returned packages

Number of packages implementing GMMs

Packages implementing GMMs

Packages kept

Bioconductor

mixture

15

3

epigenomix, fmrs, semisup

Bioconductor

cluster

69

1

Melissa

CRAN

mixture

179

44

AdaptGauss, bgmm, bmixture, bpgmm, CAMAN, ClusterR, deepgmm, DPP, dppmix, EMCluster, EMMIXgene, EMMIXmfa, fabMix, , flexmix, fmerPack, GMKMcharlie, IMIX, ManlyMix, mclust, MGMM, mixAK, MixAll, mixdist, mixR, mixreg, mixsmsn, mixtools, mixture, MixtureInf, MMDvariance, nor1mix, pcensmix, pgmm, pmclust, polySegratioMM, rebmix, Rmixmod, RMixtComp, RobMixReg, RPMM, SAGMM, sensory, SMNCensReg

bgmm, EMCluster, flexmix, GMKMcharlie, mclust, mixtools, Rmixmod

CRAN

cluster

418

16

ClusterR, clustMD, DCEM, EMCluster, HDclassif, ManlyMix, mclust, mixAK, MixAll, mixture, oclust, otrimle, pmclust, rebmix, Rmixmod, tclust

EMCluster, mclust, Rmixmod

At this stage, too many packages for a tractable benchmark remained. We hence perform stricter selection of them, based on the following criteria:

We then removed the packages in which the MLE estimation of the unconstrained GMM model was an ancillary task:

The popularity of the selected packages varies largely, as shown in Figure 3. Among them, mclust and flexmix are largely the most popular, followed by mixtools and increasingly popular Rmixmod package.

Figure 3: Number of daily downloads (logarithmic scale) from the CRAN mirror from the 1st of January to the 30th, April 2022 for the seven R packages reviewed.

Only the packages dedicated to high-dimensionality, listed in our first bullet point, are relevant to benchmark their performance as a function of the number of dimensions. Indeed, although some packages computing mixtures of regressions do implement features allowing to to handle high-dimensional datasets, such as RobMixReg and fmerPack, they all assume a diagonal covariance structure, and accordingly independent covariates.

The two existing strategies are then limited to projection to a smaller subspace, usually within the theoretical framework of factor analysis or to perform a feature selection strategy. We quickly discarded fabMix, since it only retrieves the parameters of GMMs within a Bayesian framework, while we focused on strategies retrieving the MLE via the EM algorithm. The core function pgmm::pgmmEM in the pgmm package unfortunately includes a seed for the the algorithm’s initialisation that cannot be disabled. Such a feature is generally not recommended for reproducibility, since by defining the seed internally in the function, we were not able to independently generate new reproducible datasets in our benchmark (instead, it is recommended to set the seed value once and for all at the beginning of the virtual simulation). Additionally, while implementing the same AECM variant of the EM algorithm as EMMIXmfa, as detailed in Appendix B: Extensions of the EM algorithm to overcome its limitations, its convergence criteria differs from the other benchmarked packages. Indeed, instead of considering a limiting number of iterations along with a prior threshold, either absolute or relative, it examines only the difference between the current value of the log-likelihood and a corresponding asymptotic estimate, based on the Aitken acceleration (Lindsay 1995). In brief, the asymptotic value of the log-likelihood is the limiting sum of a geometric series, whose common ratio, the so-called Aitken acceleration, is the relative fraction of the log-likelihood gain of the current iteration. Therefore, the use of a different termination criterion precludes any further fair comparison with the other benchmarked packages, as there is no direct equivalence between the two methods.

Finally, clustvarsel is not really tailored for datasets with a large number of dimensions, but rather for datasets with a small number of observations. Indeed, by performing a sequential search in the model space in a forward-backward process (namely by adding variables to the null model till we recover the full model, with all features), the algorithm requires intensive computational resources (for instance, there are already \(2^{10}=1024\) models to be tested in dimension 10). Instead of a greedy strategy, it would have been interesting to implement a in independent and highly parallelizable feature selection process by sampling the model space. To do so, (Chen and Chen 2008; He and Chen 2016) suggests a stochastic and greedy feature selection strategy, using notably the eBIC criteria in order to have an equal chance to draw a model of any dimension9. Such a strategy is commonly used in ensemble learning.

Practical details for the implementation of our benchmark

First, the number of observations (\(n=200\) and \(n=500\) respectively in the univariate and bivariate setting) was chosen enough high to both lower the probability of generating a sample without drawing any observation from one of the components in case of highly-unbalanced clusters and decreases the margin of error related to the random sampling error. Specifically, the probability of generating at least one simulation among the \(N\) generated fo which less than two observations proceed from component \(j\) (the minimal number of elements required to estimate both the mean and the variance of the corresponding cluster), with a two-components mixture of \(n\) observations, is given by the following formula (Equation (27)):

\[\begin{equation} 1 - \left(1 - (1-p_j)^n - n \times (1-p_j)^{n-1}\times p_j\right)^N \tag{27} \end{equation}\]

Interestingly, the probability of generating a sample among the \(N\) repetitions increases exponentially as the level of imbalance increases. For instance, considering \(N=100\) repetitions, \(n=200\) observations per sample and proportion for the minor component \(p_j=0.1\), the probability of generating a degenerate simulation is insignificant: \(1.63 \times 10^{-6}\) while the risk considerably increases, keeping the same general simulation parameters and setting minor proportion to \(p_j=0.05\), with a probability of \(0.04\). We have focused on one of the impacts of high dimensionality, namely that related to the homogenisation and convergence of any distance norm and the increase in sparsity in relation with the number of features added. We deliberately do not consider the case where the number of dimensions exceeds the number of observations (namely, when \(D>n\)), since in this configuration, the covariance matrix is no longer of full rank and invertible, implying that the corresponding probability distribution does spans completely over a smaller subspace. However, with so few observations, (\(n=200\) in scenarios identified as a), we reveal the impact in terms of the quality of the estimation when the number of observations is closed to the number of free parameters required to parametrise the full GMM model (with \(k=2\) clusters and \(D=10\) dimensions, \(k \times \frac{D(D+1)}{2} + kD + 1 = 131\) are needed.).

Unless stated explicitly, we keep the default hyper-parameters and custom global options provided by each package. For instance, the flexmix package has a default option, minprior, set by default to 0.05 which removes any component present in the mixture with a ratio below \(0.05\). Besides, we only implement the fully unconstrained model in both univariate and multivariate settings, as it is the only parametrisation implemented in all the seven packages and the most popular to perform classic GMM clustering, as no restrictive and difficult-to-test assumptions are required.

Additionally, as stated in Parameters estimation in a high-dimensional context, the intrinsic dimension \(d_j\) for each cluster \(j\) is a hyperparameter, which is generally inferred independently from the GMM estimation itself. While a variety of methods from the field of factor analysis, enumerated in Factor criteria selection, have been developed to estimate the intrinsic dimension, to our knowledge, only two of them have been implemented in CRAN packages: the Cattell’s scree-test (Cattell 1966) or the dimension selection graph using one of the penalty metric discussed in the appendix Model selection (Bergé et al. 2012). However, while HDclassif natively implements a performance criterion method for determining the dimension of the spanning space, performed under the hood by function NA, none of the other packages evaluated implemented a dimension selection feature. Instead, we infer it for each of the packages dedicated to high-dimensionality with HDclassif, using using the so-called model “AkjBkQkD”, for which the intrinsic dimension is common to all components but the characteristics unique for each component Finally, we use among all supplied parametrisations, the least constrained one. Namely, we used the model “AkjBkQkDk” with HDclassif, in which not only the individual features of the covariance matrix but also the spanning dimension are unique for each cluster, and function EMMIXmfa::mcfa() of the EMMIXmfa package, in which the transition matrix is common to all components (referred to as the orientation matrix in Appendix Parameters estimation in a high-dimensional context.

If all the seven reviewed packages accept initial estimates provided by the user, both the input and the output format differ between them, requiring an intensive processing to standardise both the initial estimates input, and the output estimates. Notably, a well-known issue with the mixture models is that they identifiable up to a permutation of the components (alternatively, changing the index of the labels do not change the likelihood of the model). Assigning one component of the mixture to a specific index is generally immaterial, as the main objective is to return the estimates. However, when it comes to compare the estimated parameters with the true estimates, we must associate unequivocally each component to a specific index. To do so, we set a partial ordering, sorting the components by increasing order of their mean components. Actually, if the ratio or the covariance estimates can be equal for all the components, it is generally not the case for the centroids, as this would result into a degenerate distribution. The consequence and some illustrations of the non-identifiability of the mixture distributions are discussed in section Identifiability of finite mixture models, in Dai and Mukherjea (2001) and in Book Robert and Casella (2010).

We detail below some additional functions we implement to both homogenise input and output of the packages and ease the user’s task when comparing the performance of these packages:

On the contrary, none of the packages we evaluated that were dedicated to high-dimensional datasets allow the user to provide its own estimates. Thus, when any of the benchmarked initialisation methods listed in Table 1, was internally available in the package, we use it with the same hyperparameters described in main paper, section Initialisation of the EM algorithm. If not, we provide instead a vector containing the MAP assignments inferred by the native initialisation method, in a process similar to that used used with hierarchical clustering.

In addition to the plots displaying the bootstrap parameter estimations associated to Scenarios in Tables 5, 10 and 15, we have computed summary statistics to compare the performances of the reviewed packages:

The main differences across packages as well as performance results obtained across packages in each univariate, bivariate and high-dimensional simulation scenario are thoroughly described in the next section.

4 Appendix D: comprehensive report from the univariate and multivariate benchmark

EM-implementation differences across reviewed packages

Most of the distinct behaviours between the packages result from additional choices external to the EM algorithm itself, aiming at partly overcoming its main limitations (Panel B, Figure 9). We detail below their differences ranked by decreasing order of their leverage effect on the final estimate:

  1. Most of the differences between the two classes of packages (Figure (2)) are related to the either relative or absolute choice for the termination criterion of the EM algorithm. Given an user-defined threshold, the absolute method early stops the estimation by comparing the difference between two consecutive log-likelihoods, \(|\ell(\hat{\theta}_{q}|X) - \ell(\hat{\theta}_{q-1}|X)|\), while the relative method examines the variation rate, \(\left\lvert\frac{\ell(\hat{\theta}_{q}|X) - \ell(\hat{\theta}_{q-1}|X)}{\ell(\hat{\theta}_{q-1}|X)}\right\lvert\).
  2. Several methods can be used to deal with numerical underflow, mostly happening with highly unlikely observations, distant from any centroid.
    • The least elaborate feature is from Rmixmod, returning an error when either any of the posterior probabilities or any of the estimated parameters goes below to the precision threshold of the machine (2.22^{-16} for most OS).

    • If the maximal value of any posterior probability is null, bgmm subtracts the minimal logarithm posterior probability to any log-computed probability. This method avoids numerical underflow by preventing computation of null ratios but the correctness of the estimates is no longer enforced10.

    • The remaining packages handled numeric underflow in a more convincing manner as they guarantee to return the MLE estimate. The flexmix, GMKMcharlie and EMCluster packages use the same log-rescaling tip detailed in (Application of the EM algorithm to GMMs). The mixtools and mclust packages use a variant of this trick, taking profit of the factorisation by the greatest element (Equation (28), Equation 3 p.5 Benaglia et al. (2009)), but without exploiting the tip of Taylor’s development over \(\log(1+x)\):

\[\begin{equation} \eta_{i} (j) = \frac{p_{j} \, \varphi_{\zeta_j} (x)}{\sum_{j=1}^k p_{j} \, \varphi_{\zeta_j} (x)}=\frac{\frac{p_{j} \, \varphi_{\zeta_j} (x)}{p_{j_{\min}} \, \varphi_{\zeta_{j_{\min}}} (x)}}{1+ \frac{\sum_{j\neq j_{\min}} p_{j} \, \varphi_{\zeta_j} (x)}{p_{j_{\min}} \, \varphi_{\zeta_{j_{\min}}} (x)}} \tag{28} \end{equation}\]

In both cases, the computation of the smallest posterior probability, the most proned to be assigned a null value, is avoided, avoiding inconsistent ratios of type \(0/0\).

We enumerate below some additional features supplied by the packages:

Supplementary Figures and Tables in the univariate simulation

Table below (5) lists the complete set of parameters used to simulate the univariate Gaussian mixture distribution in our benchmark:

Table 5: The 9 parameter configurations tested to generate the samples of the univariate experiment, with \(k=4\) components.
ID Entropy OVL Proportions Means Correlations
U1 1.00 3.3e-05 0.25 / 0.25 / 0.25 / 0.25 0 / 4 / 8 / 12 0.3 / 0.3 / 0.3 / 0.3
U2 1.00 5.7e-03 0.25 / 0.25 / 0.25 / 0.25 0 / 4 / 8 / 12 1 / 1 / 1 / 1
U3 1.00 2.0e-02 0.25 / 0.25 / 0.25 / 0.25 0 / 4 / 8 / 12 2 / 2 / 2 / 2
U4 0.96 3.3e-05 0.2 / 0.4 / 0.2 / 0.2 0 / 4 / 8 / 12 0.3 / 0.3 / 0.3 / 0.3
U5 0.96 5.8e-03 0.2 / 0.4 / 0.2 / 0.2 0 / 4 / 8 / 12 1 / 1 / 1 / 1
U6 0.96 2.0e-02 0.2 / 0.4 / 0.2 / 0.2 0 / 4 / 8 / 12 2 / 2 / 2 / 2
U7 0.68 2.7e-05 0.1 / 0.7 / 0.1 / 0.1 0 / 4 / 8 / 12 0.3 / 0.3 / 0.3 / 0.3
U8 0.68 4.4e-03 0.1 / 0.7 / 0.1 / 0.1 0 / 4 / 8 / 12 1 / 1 / 1 / 1
U9 0.68 1.5e-02 0.1 / 0.7 / 0.1 / 0.1 0 / 4 / 8 / 12 2 / 2 / 2 / 2

Figure 4-Figure 8 each summarise the benchmarking results associated with one of the scenarios listed in Table 5.

Summary tables 6- 9 display the average performance for each package of the benchmark with each initialisation method. The best performing pair (lowest bias or MSE) is highlighted in green, and the worst performing in red. The MSE and bias columns were derived by summing respectively the estimated proportions, means and standard deviations associated with the individual components.

Table 6: MSE and Bias associated to scenario U1, in Table 5 (balanced and well-separated components)

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

EMCluster / GMKMcharlie

hc

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

kmeans

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

quantiles

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

random

0.0061

1.90000

0.19000

0.0290

0.5300

0.1100

rebmix

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

flexmix

hc

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

kmeans

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

quantiles

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

random

0.0064

1.80000

0.19000

0.0290

0.5300

0.1100

rebmix

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

mclust / bgmm

hc

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

kmeans

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

quantiles

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

random

0.0062

1.80000

0.19000

0.0290

0.5300

0.1100

rebmix

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

mixtools

hc

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

kmeans

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

quantiles

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

random

0.0064

1.90000

0.19000

0.0290

0.5300

0.1100

rebmix

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

Rmixmod / RGMMBench

hc

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

kmeans

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

quantiles

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

random

0.0064

1.90000

0.19000

0.0290

0.5300

0.1100

rebmix

0.0004

0.00077

0.00037

0.0025

0.0068

0.0028

Benchmark summary plots of scenario U1 in Table \@ref(tab:parameter-configuration-univariate) (balanced and well-separated components), organised as such:

- The panel A displays the distribution of the global mixture distribution $f_{\theta}(X)$
(pink solid line) and of each of its constitutive components scaled by
their respective proportions (dotted lines).

- Running times are displayed in Panel B with the *k*-means initialisation. The number of observations
(x-axis) and the running time (y-axis) is in $\log(10)$ scale, implying
that any linear relationship between the running time and the number of
observations is represented by a slope of 1. The points represent median
running time. The coloured bands represent the 5th and 95th percentiles
of the running time.

- In panel C are represented the boxplots associated with the distribution of the estimates, with one box per pair of package and initialisation method. The median is displayed with bold
black line, the mean with a yellow cross and the 0.25 and 0.95 quantiles
match the edges of the rectangular band. Solid black lines extending
past the box boundaries represent the $1.5$ IQR, estimates above these
limits considered as outliers and omitted from the plot. Finally, the
true value of the parameter is represented as a dashed red line. The bold black writing in the upper right-hand corner refers to the parameter whose distribution is shown in the corresponding facet. The first, second and third rows are the distributions of the ratios, means and variances of each component, identified by the column index.

Figure 4: Benchmark summary plots of scenario U1 in Table 5 (balanced and well-separated components), organised as such:

  • The panel A displays the distribution of the global mixture distribution \(f_{\theta}(X)\) (pink solid line) and of each of its constitutive components scaled by their respective proportions (dotted lines).

  • Running times are displayed in Panel B with the k-means initialisation. The number of observations (x-axis) and the running time (y-axis) is in \(\log(10)\) scale, implying that any linear relationship between the running time and the number of observations is represented by a slope of 1. The points represent median running time. The coloured bands represent the 5th and 95th percentiles of the running time.

  • In panel C are represented the boxplots associated with the distribution of the estimates, with one box per pair of package and initialisation method. The median is displayed with bold black line, the mean with a yellow cross and the 0.25 and 0.95 quantiles match the edges of the rectangular band. Solid black lines extending past the box boundaries represent the \(1.5\) IQR, estimates above these limits considered as outliers and omitted from the plot. Finally, the true value of the parameter is represented as a dashed red line. The bold black writing in the upper right-hand corner refers to the parameter whose distribution is shown in the corresponding facet. The first, second and third rows are the distributions of the ratios, means and variances of each component, identified by the column index.

Table 7: MSE and Bias associated to scenario U7, in Table 5 (unbalanced and well-separated components)

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

EMCluster / GMKMcharlie

hc

0.02900

2.8000

0.45000

0.1000

0.840

0.250

kmeans

0.00730

0.7900

0.13000

0.0260

0.240

0.075

quantiles

0.16000

19.0000

3.20000

0.6400

6.100

1.800

random

0.17000

10.5000

1.40000

0.3600

3.100

0.780

rebmix

0.00027

0.0015

0.00077

0.0025

0.014

0.014

flexmix

hc

0.05500

2.8000

0.45000

0.1100

0.850

0.250

kmeans

0.00760

0.7800

0.13000

0.0260

0.240

0.075

quantiles

0.11000

19.0000

3.20000

0.5400

6.000

1.900

random

0.15000

8.4000

1.00000

0.3000

2.500

0.580

rebmix

0.00027

0.0015

0.00076

0.0025

0.014

0.011

mclust / bgmm

hc

0.03200

2.8000

0.45000

0.1000

0.850

0.250

kmeans

0.00740

0.7800

0.13000

0.0260

0.240

0.075

quantiles

0.14000

19.0000

3.20000

0.6000

6.000

1.900

random

0.18000

10.4000

1.40000

0.3600

3.100

0.800

rebmix

0.00027

0.0015

0.00077

0.0025

0.014

0.014

mixtools

hc

0.03200

2.8000

0.45000

0.1000

0.850

0.250

kmeans

0.00620

0.7600

0.13000

0.0170

0.230

0.079

quantiles

0.15000

19.0000

3.20000

0.5800

6.000

1.800

random

0.18000

10.3000

1.40000

0.3600

3.100

0.800

rebmix

0.00027

0.0015

0.00077

0.0025

0.014

0.014

Rmixmod / RGMMBench

hc

0.02900

2.8000

0.45000

0.1000

0.850

0.250

kmeans

0.00540

0.7700

0.13000

0.0190

0.230

0.078

quantiles

0.14000

19.0000

3.20000

0.5900

6.000

1.800

random

0.17000

10.4000

1.40000

0.3600

3.100

0.800

rebmix

0.00027

0.0015

0.00077

0.0025

0.014

0.014

Benchmark summary plots of scenario U7 in Table \@ref(tab:parameter-configuration-univariate) (unbalanced and well-separated components), with same layout as in Figure \@ref(fig:four-component-balanced-separated).

Figure 5: Benchmark summary plots of scenario U7 in Table 5 (unbalanced and well-separated components), with same layout as in Figure 4.

Table 8: MSE and Bias associated to scenario U3, in Table 5 (balanced and overlapping components)

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

EMCluster / GMKMcharlie

hc

0.0170

1.60

0.45

0.1950

1.320

0.93

kmeans

0.0054

0.81

0.18

0.0125

0.023

0.32

quantiles

0.0070

0.67

0.30

0.0930

0.590

0.56

random

0.0440

8.40

1.00

0.0710

0.330

0.63

rebmix

0.0990

11.00

1.60

0.2000

1.600

0.78

flexmix

hc

0.0260

2.60

0.94

0.1120

1.160

1.22

kmeans

0.0044

0.67

0.14

0.0036

0.091

0.27

quantiles

0.0054

0.57

0.27

0.0850

0.670

0.55

random

0.0420

8.20

1.10

0.0450

0.450

0.68

rebmix

0.1210

14.30

2.70

0.2700

2.700

1.17

mclust / bgmm

hc

0.0110

2.50

0.84

0.0330

1.160

1.10

kmeans

0.0068

0.86

0.24

0.0294

0.114

0.36

quantiles

0.0075

0.70

0.32

0.1110

0.720

0.63

random

0.0490

9.10

1.20

0.0800

0.320

0.68

rebmix

0.1410

10.90

2.90

0.2900

1.800

1.47

mixtools

hc

0.0320

2.40

0.80

0.0670

0.360

0.25

kmeans

0.0415

2.51

1.11

0.1000

0.664

0.74

quantiles

0.0383

2.40

1.00

0.1170

0.770

0.78

random

0.0660

9.40

1.80

0.0130

0.340

0.48

rebmix

0.1090

9.60

2.50

0.2600

1.800

1.33

Rmixmod / RGMMBench

hc

0.0220

2.00

0.67

0.0490

0.370

0.25

kmeans

0.0318

2.31

0.85

0.0952

0.602

0.67

quantiles

0.0297

2.19

0.80

0.1210

0.770

0.76

random

0.0620

9.40

1.70

0.0160

0.310

0.50

rebmix

0.1140

10.30

2.60

0.2600

1.900

1.31

Benchmark summary plots of scenario U3 in Table \@ref(tab:parameter-configuration-univariate) (balanced and overlapping components), with same layout as in Figure \@ref(fig:four-component-balanced-separated).

Figure 6: Benchmark summary plots of scenario U3 in Table 5 (balanced and overlapping components), with same layout as in Figure 4.

Table 9: MSE and Bias associated to scenario U9, in Table 5 (unbalanced and overlapping components)

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

EMCluster / GMKMcharlie

hc

0.230

9.3

0.94

0.78

4.9

1.33

kmeans

0.094

5.1

0.57

0.50

3.4

0.89

quantiles

0.230

9.9

1.00

0.80

5.1

1.34

random

0.270

11.5

0.90

0.63

2.8

0.85

rebmix

0.330

20.0

2.20

0.53

5.3

0.84

flexmix

hc

0.170

10.5

0.88

0.64

5.2

1.14

kmeans

0.051

5.6

0.61

0.34

3.6

0.94

quantiles

0.210

11.3

1.20

0.75

5.6

1.53

random

0.180

9.5

0.77

0.43

2.7

0.86

rebmix

0.110

10.0

1.70

0.15

2.3

1.48

mclust / bgmm

hc

0.230

10.2

0.84

0.79

5.1

1.20

kmeans

0.107

5.5

0.62

0.53

3.6

0.96

quantiles

0.270

11.4

1.20

0.87

5.6

1.59

random

0.300

12.2

1.06

0.66

2.9

0.84

rebmix

0.270

21.0

2.50

0.46

5.2

1.13

mixtools

hc

0.200

9.7

1.19

0.64

3.4

0.69

kmeans

0.135

7.7

1.16

0.46

2.1

0.48

quantiles

0.280

11.2

1.60

0.74

4.2

0.72

random

0.350

15.7

1.62

0.65

2.1

0.64

rebmix

0.240

22.0

2.70

0.47

5.1

1.18

Rmixmod / RGMMBench

hc

0.210

9.5

1.07

0.69

3.8

0.79

kmeans

0.113

6.5

0.90

0.46

2.4

0.43

quantiles

0.240

10.1

1.30

0.74

4.2

0.81

random

0.320

14.6

1.45

0.61

2.1

0.58

rebmix

0.250

22.0

2.70

0.49

5.2

1.18

Benchmark summary plots of scenario U9 in Table \@ref(tab:parameter-configuration-univariate) (unbalanced and overlapping components), with same layout as in Figure \@ref(fig:four-component-balanced-separated).

Figure 7: Benchmark summary plots of scenario U9 in Table 5 (unbalanced and overlapping components), with same layout as in Figure 4.

Benchmark summary plots of scenarios U4 and U6 in Table \@ref(tab:parameter-configuration-univariate) (small unbalance, with additional overlap in scenario U6). Panel A and B display the univariate GMM distributions of respectively scenarios U4 and U6, and Panel C and D the benchmarked distributions of respectively scenarios U4 and U6, built as Panel C of Figure \@ref(fig:four-component-balanced-separated).

Figure 8: Benchmark summary plots of scenarios U4 and U6 in Table 5 (small unbalance, with additional overlap in scenario U6). Panel A and B display the univariate GMM distributions of respectively scenarios U4 and U6, and Panel C and D the benchmarked distributions of respectively scenarios U4 and U6, built as Panel C of Figure 4.

Correlation heatmaps of the estimated parameters extended to the four initialisation methods benchmarked, using the same configuration described in Figure (2), in the bivariate setting.

Figure 9: Correlation heatmaps of the estimated parameters extended to the four initialisation methods benchmarked, using the same configuration described in Figure (2), in the bivariate setting.

Distribution of the running times taken by each initialisation algorithm enumerated in Table \@ref(tab:general-parameter-description-html), across all scenarios listed in Table \@ref(tab:parameter-configuration-univariate), sorted by increasing ID number in the lexicographical order.

Figure 10: Distribution of the running times taken by each initialisation algorithm enumerated in Table 1, across all scenarios listed in Table 5, sorted by increasing ID number in the lexicographical order.

The panels indexed by the B letter, from Figure 4 to Figure 8, display the 0.05, 0.5 and 0.95 quantiles of the distribution of the operating times taken for parameter estimation, for the scenarios listed in Table 5.

First, we note that the execution time grows asymptotically linearly with the number of observations, confirming empirically the expected linear complexity of the EM algorithm. The most important factor playing on the differences observed is related to the complexity of the distribution, and especially the degree of overlap between the components:

To better understand the running times’ differences observed between the packages for a given scenario, we perform a three-way anova, taking into account the choice of initialisation method, the programming language and the class of packages15:

Supplementary Figures and Tables in the bivariate simulation

Table below (10) lists the complete set of parameters used to simulate the multivariate Gaussian mixture distribution in our benchmark:

Table 10: The 20 parameter configurations tested to generate the samples of the bivariate experiment.
ID Entropy OVL Proportions Means Correlations
B1 1.00 0.15000 0.5 / 0.5 (0,2);(2,0) -0.8 / -0.8
B2 1.00 0.07300 0.5 / 0.5 (0,2);(2,0) -0.8 / 0.8
B3 1.00 0.07300 0.5 / 0.5 (0,2);(2,0) 0.8 / -0.8
B4 1.00 0.00078 0.5 / 0.5 (0,2);(2,0) 0.8 / 0.8
B5 1.00 0.07900 0.5 / 0.5 (0,2);(2,0) 0 / 0
B6 1.00 0.00000 0.5 / 0.5 (0,20);(20,0) -0.8 / -0.8
B7 1.00 0.00000 0.5 / 0.5 (0,20);(20,0) -0.8 / 0.8
B8 1.00 0.00000 0.5 / 0.5 (0,20);(20,0) 0.8 / -0.8
B9 1.00 0.00000 0.5 / 0.5 (0,20);(20,0) 0.8 / 0.8
B10 1.00 0.00000 0.5 / 0.5 (0,20);(20,0) 0 / 0
B11 0.47 0.06600 0.9 / 0.1 (0,2);(2,0) -0.8 / -0.8
B12 0.47 0.01600 0.9 / 0.1 (0,2);(2,0) -0.8 / 0.8
B13 0.47 0.05000 0.9 / 0.1 (0,2);(2,0) 0.8 / -0.8
B14 0.47 0.00045 0.9 / 0.1 (0,2);(2,0) 0.8 / 0.8
B15 0.47 0.03900 0.9 / 0.1 (0,2);(2,0) 0 / 0
B16 0.47 0.00000 0.9 / 0.1 (0,20);(20,0) -0.8 / -0.8
B17 0.47 0.00000 0.9 / 0.1 (0,20);(20,0) -0.8 / 0.8
B18 0.47 0.00000 0.9 / 0.1 (0,20);(20,0) 0.8 / -0.8
B19 0.47 0.00000 0.9 / 0.1 (0,20);(20,0) 0.8 / 0.8
B20 0.47 0.00000 0.9 / 0.1 (0,20);(20,0) 0 / 0

Figures 11- 14 are associated to scenarios B11 - B15 of Table 10. Summary tables 11-14 show the average performance for each combination of a benchmarked package and initialisation method, with the same conventions as discussed in Supplementary Figures and Tables in the univariate simulation.

First, we can directly observe that the OVL increases as the individual variance of each component, the proximity of the centroids of the clusters and the level of imbalance is increased. We demonstrate this statement formally in section An analytic formula of the overlap for univariate Gaussian mixtures. Nonetheless, the influence of the correlation between the \(x\) and the \(y\)-axis (the off-diagonal term of the covariance matrix) is not immediate, notably the assumption of independent features does not automatically entail a lower OVL or simpler estimation.

From our experiments, we deduce that the highest OVL is obtained when the main axis of the two respective components aligns with the line joining the two centroids. For instance, in our scenario, the lowest OVL is obtained when the correlation term is positive for both clusters (scenario 14, Table 10 and isodensity plot in panel A, Figure 13), whereas the highest OVL is obtained with a negative correlation (scenario 11, Table 10and isodensity plot in panel A, Figure 11). Recall that the slope joining the two centroids of the two components in all our simulated distributions is indeed negative.

Table 11: MSE and Bias associated to scenario B11, in Table 10 (unbalanced, overlapping and negative correlated components.)

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

EMCluster / GMKMcharlie

hc

0.230

3.90

1.8

0.550

2.30

1.200

kmeans

0.136

2.80

1.9

0.450

2.30

2.200

random

0.028

1.27

1.1

0.084

0.12

0.140

rebmix

0.071

2.20

1.4

0.170

0.66

0.111

flexmix

hc

0.260

3.90

1.9

0.480

2.40

1.300

kmeans

0.077

2.80

1.9

0.270

2.40

2.300

random

0.028

0.96

1.0

0.064

0.77

0.720

rebmix

0.087

1.90

1.0

0.170

1.02

0.468

mclust / bgmm

hc

0.230

3.90

1.8

0.550

2.30

1.200

kmeans

0.136

2.80

1.9

0.450

2.30

2.200

random

0.028

1.27

1.1

0.084

0.12

0.140

rebmix

0.071

2.20

1.4

0.170

0.66

0.111

mixtools

hc

0.210

3.30

1.8

0.470

1.80

1.100

kmeans

0.131

2.60

1.8

0.380

1.80

1.800

random

0.051

1.61

1.1

0.129

0.20

0.180

rebmix

0.093

2.40

1.4

0.210

0.60

0.063

Rmixmod / RGMMBench

hc

0.210

3.30

1.8

0.470

1.80

1.100

kmeans

0.131

2.60

1.8

0.380

1.80

1.800

random

0.051

1.61

1.1

0.129

0.20

0.180

rebmix

0.093

2.40

1.4

0.210

0.60

0.063

Results of scenario B11 in Table \@ref(tab:parameter-configuration-bivariate) (unbalanced, overlapping and negative correlated components), organised as such:

 - The panel A displays the bivariate contour maps associated to the two-components multivariate Gaussian distribution corresponding to the parametrisation described by the scenario, warmer colours corresponding to regions of higher densities. The two centroids, whose coordinates are given by the mean components' elements, are represented with distinct shaped and coloured point estimates.

  - In both Panels A and B, the ellipsoids correspond to the $95\%$ confidence region associated to each component's distribution. To generate them, we largely inspired from the <a href='https://rdrr.io/pkg/mixtools/man/ellipse.html'>mixtools::ellipse()</a> and website [How to draw ellipses](https://cookierobotics.com/007/). To generate them, we retain for each individual parameter its mean (similar results with the median) over the $N=100$ sampling experiments, restrained to the random initialisation method.

- The running times are displayed in Panel C with the *k*-means initialisation. The number of observations
(x-axis) and the running time (y-axis) is in $\log(10)$ scale. The points represent median
running time. The coloured bands represent the $5^{\text{th}}$ and $95^{\text{th}}$ percentiles
of the running time.

- The distributions of the Hellinger distances (a closed form is only available for the Gaussian multivariate distribution, not the mixture) are computed for each component, each initialisation method and each package with respect to the true Gaussian distribution expected for each component. The more dissimilar are the distributions, the higher is the Hellinger distance, knowing it is normalised between 0 and 1^[More on the Hellinger distance as a quantifier of the similarity between two probability distributions in [Hellinger distance](https://en.wikipedia.org/wiki/Hellinger_distance)]. We represent them using boxplot representations in Panel D.

- In panel E we represent the boxplots associated with the distribution of the estimates, with one box per pair of package and initialisation method, using the same convenstions detailed in [Supplementary Figures and Tables in the univariate simulation]. As the correlation is a symmetric operator, we only represent the distribution of the lower part of the lower matrix. Each column is associated to the parameters of a component. First row represents the distribution of the estimated ratios, second and third respectively the distributions of the mean vector on the x-axis and on the y-axis, third and four the distributions of the individual variances of each feature and finally the fifth row shows the distribution of the correlation between dimension 1 and 2.

Figure 11: Results of scenario B11 in Table 10 (unbalanced, overlapping and negative correlated components), organised as such:

  • The panel A displays the bivariate contour maps associated to the two-components multivariate Gaussian distribution corresponding to the parametrisation described by the scenario, warmer colours corresponding to regions of higher densities. The two centroids, whose coordinates are given by the mean components’ elements, are represented with distinct shaped and coloured point estimates.

  • In both Panels A and B, the ellipsoids correspond to the \(95\%\) confidence region associated to each component’s distribution. To generate them, we largely inspired from the mixtools::ellipse() and website How to draw ellipses. To generate them, we retain for each individual parameter its mean (similar results with the median) over the \(N=100\) sampling experiments, restrained to the random initialisation method.

  • The running times are displayed in Panel C with the k-means initialisation. The number of observations (x-axis) and the running time (y-axis) is in \(\log(10)\) scale. The points represent median running time. The coloured bands represent the \(5^{\text{th}}\) and \(95^{\text{th}}\) percentiles of the running time.

  • The distributions of the Hellinger distances (a closed form is only available for the Gaussian multivariate distribution, not the mixture) are computed for each component, each initialisation method and each package with respect to the true Gaussian distribution expected for each component. The more dissimilar are the distributions, the higher is the Hellinger distance, knowing it is normalised between 0 and 116. We represent them using boxplot representations in Panel D.

  • In panel E we represent the boxplots associated with the distribution of the estimates, with one box per pair of package and initialisation method, using the same convenstions detailed in Supplementary Figures and Tables in the univariate simulation. As the correlation is a symmetric operator, we only represent the distribution of the lower part of the lower matrix. Each column is associated to the parameters of a component. First row represents the distribution of the estimated ratios, second and third respectively the distributions of the mean vector on the x-axis and on the y-axis, third and four the distributions of the individual variances of each feature and finally the fifth row shows the distribution of the correlation between dimension 1 and 2.

Table 12: MSE and Bias associated to scenario B12, in Table 10(unbalanced, overlapping and opposite correlated components.)

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

EMCluster / GMKMcharlie

hc

0.00076

0.049

0.16

0.0063

0.056

0.131

kmeans

0.00076

0.049

0.16

0.0063

0.056

0.131

random

0.00075

0.049

0.16

0.0057

0.055

0.123

rebmix

0.00087

0.066

0.17

0.0070

0.063

0.190

flexmix

hc

0.00144

0.049

0.16

0.0101

0.055

0.071

kmeans

0.00144

0.049

0.16

0.0101

0.055

0.071

random

0.00144

0.050

0.16

0.0099

0.054

0.067

rebmix

0.00145

0.048

0.16

0.0142

0.047

0.110

mclust / bgmm

hc

0.00076

0.049

0.16

0.0063

0.056

0.131

kmeans

0.00076

0.049

0.16

0.0063

0.056

0.131

random

0.00075

0.049

0.16

0.0057

0.055

0.124

rebmix

0.00087

0.066

0.17

0.0070

0.063

0.190

mixtools

hc

0.00075

0.050

0.16

0.0049

0.054

0.112

kmeans

0.00075

0.050

0.16

0.0049

0.054

0.112

random

0.00075

0.050

0.16

0.0049

0.054

0.112

rebmix

0.00086

0.066

0.17

0.0061

0.062

0.170

Rmixmod / RGMMBench

hc

0.00075

0.050

0.16

0.0049

0.054

0.112

kmeans

0.00075

0.050

0.16

0.0049

0.054

0.112

random

0.00075

0.050

0.16

0.0049

0.054

0.112

rebmix

0.00086

0.066

0.17

0.0061

0.062

0.170

Results of scenario B12 in Table \@ref(tab:parameter-configuration-bivariate)  (unbalanced, overlapping and opposite correlated components), with the same layout as Figure \@ref(fig:multivariate-overlapping-unbalanced-negative-correlated).

Figure 12: Results of scenario B12 in Table 10 (unbalanced, overlapping and opposite correlated components), with the same layout as Figure 11.

Table 13: MSE and Bias associated to scenario B14, in Table 10 (unbalanced, overlapping and positive correlated components.)

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

EMCluster / GMKMcharlie

hc

0.00043

0.044

0.13

0.00081

0.044

0.060

kmeans

0.00043

0.044

0.13

0.00081

0.044

0.060

random

0.00043

0.044

0.13

0.00080

0.044

0.060

rebmix

0.00040

0.044

0.13

0.00120

0.047

0.053

flexmix

hc

0.00043

0.044

0.13

0.00072

0.043

0.035

kmeans

0.00043

0.044

0.13

0.00072

0.043

0.035

random

0.00043

0.044

0.13

0.00072

0.044

0.035

rebmix

0.00040

0.044

0.14

0.00110

0.047

0.044

mclust / bgmm

hc

0.00043

0.044

0.13

0.00081

0.044

0.060

kmeans

0.00043

0.044

0.13

0.00081

0.044

0.060

random

0.00043

0.044

0.13

0.00080

0.044

0.060

rebmix

0.00040

0.044

0.13

0.00120

0.047

0.053

mixtools

hc

0.00043

0.044

0.13

0.00078

0.044

0.060

kmeans

0.00043

0.044

0.13

0.00078

0.044

0.060

random

0.00043

0.044

0.13

0.00078

0.044

0.060

rebmix

0.00040

0.044

0.13

0.00110

0.047

0.053

Rmixmod / RGMMBench

hc

0.00043

0.044

0.13

0.00078

0.044

0.060

kmeans

0.00043

0.044

0.13

0.00078

0.044

0.060

random

0.00043

0.044

0.13

0.00078

0.044

0.060

rebmix

0.00040

0.044

0.13

0.00110

0.047

0.053

Results of scenario B14 in Table \@ref(tab:parameter-configuration-bivariate) (unbalanced, overlapping and positive correlated components), with the same layout as Figure \@ref(fig:multivariate-overlapping-unbalanced-negative-correlated).

Figure 13: Results of scenario B14 in Table 10 (unbalanced, overlapping and positive correlated components), with the same layout as Figure 11.

Table 14: MSE and Bias associated to scenario B15, in Table 10 (unbalanced, overlapping and uncorrelated components.)

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

EMCluster / GMKMcharlie

hc

0.1110

2.30

1.30

0.280

1.40

0.90

kmeans

0.0500

1.50

1.30

0.200

1.05

1.06

random

0.0290

0.71

0.63

0.070

0.28

0.19

rebmix

0.0163

0.69

0.78

0.074

0.37

0.44

flexmix

hc

0.1330

2.40

1.40

0.240

1.50

1.05

kmeans

0.0320

1.60

1.40

0.110

1.21

1.26

random

0.0370

0.71

0.64

0.048

0.35

0.29

rebmix

0.0058

0.70

0.84

0.028

0.49

0.62

mclust / bgmm

hc

0.1110

2.30

1.30

0.280

1.40

0.90

kmeans

0.0500

1.50

1.30

0.200

1.05

1.06

random

0.0290

0.71

0.63

0.070

0.28

0.19

rebmix

0.0163

0.69

0.78

0.074

0.37

0.44

mixtools

hc

0.0860

1.90

1.20

0.220

1.10

0.75

kmeans

0.0470

1.30

1.10

0.170

0.79

0.78

random

0.0230

0.67

0.66

0.065

0.24

0.19

rebmix

0.0158

0.69

0.77

0.068

0.30

0.37

Rmixmod / RGMMBench

hc

0.0860

1.90

1.20

0.220

1.10

0.75

kmeans

0.0470

1.30

1.10

0.170

0.79

0.78

random

0.0230

0.67

0.66

0.065

0.24

0.19

rebmix

0.0158

0.69

0.77

0.068

0.30

0.37

Results of scenario B15 in Table \@ref(tab:parameter-configuration-bivariate) (unbalanced, overlapping and uncorrelated components), with the same layout as Figure \@ref(fig:multivariate-overlapping-unbalanced-negative-correlated).

Figure 14: Results of scenario B15 in Table 10 (unbalanced, overlapping and uncorrelated components), with the same layout as Figure 11.

In contrast to the univariate setting (Supplementary Figures and Tables in the univariate simulation), the fastest packages are bgmm, EMCluster, flexmix, and Rmixmod, and the slowest ones mclust, GMKMcharlie and mixtools, independently from the difficulty of the simulation.

Finally, Figures 15, 16 and 17 represent in a synthetic way less interesting scenarios benchmarked with to the left, the contour maps and to the right the corresponding Hellinger boxplots, with one scenario being illustrated per row.

Benchmark summary plots of scenarios B1, B2 and B5 in Table \@ref(tab:parameter-configuration-bivariate) featuring balanced and overlapping clusters. Summary plots of B1, B2 and B5 are represented in this order on each row, with the left column displaying the 95% confidence ellipsoidal regions associated to the mean estimated parameters across each package and the right column the distribution of the Hellinger distances.

Figure 15: Benchmark summary plots of scenarios B1, B2 and B5 in Table 10 featuring balanced and overlapping clusters. Summary plots of B1, B2 and B5 are represented in this order on each row, with the left column displaying the 95% confidence ellipsoidal regions associated to the mean estimated parameters across each package and the right column the distribution of the Hellinger distances.

Benchmark summary plots of respectively scenarios B6, B7 and B10 in Table \@ref(tab:parameter-configuration-bivariate) featuring balanced and well-separated clusters, with the same layout as Figure \@ref(fig:general-balanced-overlapping-bivariate).

Figure 16: Benchmark summary plots of respectively scenarios B6, B7 and B10 in Table 10 featuring balanced and well-separated clusters, with the same layout as Figure 15.

Benchmark summary plots of respectively scenarios B16, B17 and B20 in Table \@ref(tab:parameter-configuration-bivariate) featuring unbalanced and well-separated clusters, with the same layout as Figure \@ref(fig:general-balanced-overlapping-bivariate)..

Figure 17: Benchmark summary plots of respectively scenarios B16, B17 and B20 in Table 10 featuring unbalanced and well-separated clusters, with the same layout as Figure 15..

Correlation heatmaps of the estimated parameters in the bivariate setting extended to the four initialisation methods benchmarked, with the most discriminating scenario B11, using the same process described in Figure (2).

Figure 18: Correlation heatmaps of the estimated parameters in the bivariate setting extended to the four initialisation methods benchmarked, with the most discriminating scenario B11, using the same process described in Figure (2).

Supplementary Figures and Tables in the HD simulation

Table below (15) lists the complete set of parameters used to simulate Gaussian distributions in the high dimensional benchmark:

Table 15: The 16 parameter configurations tested to generate the samples in a high dimensional context. The first digit of each ID index refers to an unique parameter configuration (identified by its level of overlap, entropy and topological structure, either circular or ellipsoidal, of the covariance matrix, while the lowercase letter depicts the number of observations, a) with \(n=200\) and b) with \(n=2000\).
ID OVL Proportions Spherical
HD1a 1e-04 200 0.5 / 0.5
HD1b 1e-04 2000 0.5 / 0.5
HD2a 1e-04 200 0.19 / 0.81
HD2b 1e-04 2000 0.19 / 0.81
HD3a 1e-04 200 0.5 / 0.5
HD3b 1e-04 2000 0.5 / 0.5
HD4a 1e-04 200 0.21 / 0.79
HD4b 1e-04 2000 0.21 / 0.79
HD5a 2e-01 200 0.5 / 0.5
HD5b 2e-01 2000 0.5 / 0.5
HD6a 2e-01 200 0.15 / 0.85
HD6b 2e-01 2000 0.15 / 0.85
HD7a 2e-01 200 0.5 / 0.5
HD7b 2e-01 2000 0.5 / 0.5
HD8a 2e-01 200 0.69 / 0.31
HD8b 2e-01 2000 0.69 / 0.31
Table 16: MSE and Bias associated to scenario HD4a, in Table 15 (unbalanced, separated and ellipsoidal components).

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

Success

mixtools / Rmixmod / RGMMBench

hc

0.0333

0.0212

0.0106

0.0020

0.056

0.097

100

kmeans

0.0333

0.0212

0.0106

0.0020

0.056

0.097

100

rebmix

0.3244

0.1980

0.0845

0.0720

0.395

0.535

98

mclust / flexmix / GMKMcharlie

hc

0.0333

0.0212

0.0106

0.0020

0.056

0.097

100

kmeans

0.0333

0.0212

0.0106

0.0020

0.056

0.097

100

rebmix

0.2553

0.1444

0.0924

0.0470

0.364

0.596

85

bgmm

hc

0.0337

0.0214

0.0107

0.0070

0.064

0.096

100

kmeans

0.0338

0.0216

0.0106

0.0074

0.064

0.096

100

rebmix

0.4818

0.1152

0.3442

0.0320

0.223

2.329

94

EMCluster

hc

0.0333

0.0212

0.0107

0.0023

0.056

0.096

100

kmeans

0.0334

0.0213

0.0106

0.0018

0.056

0.096

100

rebmix

1.5983

1.0992

0.3794

0.3100

2.018

2.575

84

HDclassif

hc

8.4062

8.3936

0.0111

0.0020

10.426

0.149

100

kmeans

7.9407

7.9282

0.0111

0.0019

10.081

0.149

100

rebmix

7.9803

7.9514

0.0273

0.0044

10.128

0.262

84

EMMIXmfa

hc

4.0605

3.3317

0.3357

0.6500

5.757

2.772

95

kmeans

3.8790

3.2175

0.3372

0.5400

5.781

2.777

96

rebmix

4.0127

3.2715

0.3337

0.5700

5.680

2.757

80

Results of scenario HD4a) in Table \@ref(tab:parameter-configuration-HD) (unbalanced, overlapping and negative correlated components), organised as such:

 - The panel A displays the bivariate factorial projection of a random sample drawn from the 10-dimensional multivariate Gaussian distribution parametrised by Table \@ref(tab:parameter-configuration-HD). Each component is associated to a specific color, a centroid whose coordinates are given by the mean components' elements in the bivariate projected space and a $95\%$ confidence ellipse. Arrows represent the correlation circle of the dimensional variables.
 Both panels were displayed respectively using functions <a href='https://rdrr.io/pkg/factoextra/man/eigenvalue.html'>factoextra::fviz_eig()</a> and <a href='https://rdrr.io/pkg/factoextra/man/fviz_pca.html'>factoextra::fviz_pca_biplot()</a> while the underlying computations proceed from the principal component analysis performed by <a href='https://rdrr.io/pkg/ade4/man/dudi.pca.html'>ade4::dudi.pca()</a> package preceded by standard scaling of the sampling dataset.

- The panel B pictures the *parallel distribution plots* from a random sampling of $n=100$ observations, generated using the <a href='https://ggobi.github.io/ggally/reference/ggparcoord.html'>GGally::ggparcoord()</a> function, and representing the coordinates of each simulated data point in 10 dimensions.

- The running times are displayed in Panel C with the *k*-means initialisation. The number of observations (x-axis) and the running time (y-axis) is in $\log(10)$ scale.

- The distributions of the Hellinger distances are computed for each component in Panel D, each initialisation method and each package with respect to the true Gaussian distribution expected for each component.

- In panel E we represent the boxplots associated with the distribution of some of the estimates. Since it was impractical to represent all of the $k + kD + k\frac{D \times (D+1)}{2}$ with $k=2$ and $D=10$ parameters, we only represent the first component's mean, two first components' variances and their covariance term.

Figure 19: Results of scenario HD4a) in Table 15 (unbalanced, overlapping and negative correlated components), organised as such:

  • The panel A displays the bivariate factorial projection of a random sample drawn from the 10-dimensional multivariate Gaussian distribution parametrised by Table 15. Each component is associated to a specific color, a centroid whose coordinates are given by the mean components’ elements in the bivariate projected space and a \(95\%\) confidence ellipse. Arrows represent the correlation circle of the dimensional variables. Both panels were displayed respectively using functions factoextra::fviz_eig() and factoextra::fviz_pca_biplot() while the underlying computations proceed from the principal component analysis performed by ade4::dudi.pca() package preceded by standard scaling of the sampling dataset.

  • The panel B pictures the parallel distribution plots from a random sampling of \(n=100\) observations, generated using the GGally::ggparcoord() function, and representing the coordinates of each simulated data point in 10 dimensions.

  • The running times are displayed in Panel C with the k-means initialisation. The number of observations (x-axis) and the running time (y-axis) is in \(\log(10)\) scale.

  • The distributions of the Hellinger distances are computed for each component in Panel D, each initialisation method and each package with respect to the true Gaussian distribution expected for each component.

  • In panel E we represent the boxplots associated with the distribution of some of the estimates. Since it was impractical to represent all of the \(k + kD + k\frac{D \times (D+1)}{2}\) with \(k=2\) and \(D=10\) parameters, we only represent the first component’s mean, two first components’ variances and their covariance term.

Table 17: MSE and Bias associated to scenario HD7a, in Table 15 (balanced, overlapping and ellipsoidal components).

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

Success

mixtools / Rmixmod / RGMMBench

hc

5.8544

2.2153

3.5586

0.0450

2.084

6.704

100

kmeans

5.4773

1.9490

3.4819

0.0086

2.323

6.861

100

rebmix

6.9243

2.5185

4.2898

0.0620

2.670

7.626

97

mclust / flexmix / GMKMcharlie

hc

6.0584

2.4737

3.5198

0.0180

2.565

7.624

100

kmeans

5.6388

2.1597

3.4549

0.0140

2.744

8.266

100

rebmix

6.5397

2.4738

3.9661

0.0700

2.774

7.764

93

bgmm

hc

9.5015

5.1348

4.1086

0.1000

3.720

10.310

100

kmeans

8.7930

4.7119

3.8693

0.1500

3.932

10.108

100

rebmix

10.3630

5.6474

4.4026

0.2700

3.798

10.049

97

EMCluster

hc

6.4022

2.8255

3.5124

0.0120

3.141

9.086

100

kmeans

6.4333

2.8740

3.5523

0.0110

4.210

11.007

100

rebmix

6.5527

2.9643

3.4862

0.0580

3.051

9.253

93

HDclassif

hc

15.9010

11.5382

4.2950

0.1400

10.846

10.100

100

kmeans

15.3377

10.9441

4.3716

0.0087

10.990

10.771

100

rebmix

16.1231

11.1103

4.9113

0.1600

10.761

10.513

93

EMMIXmfa

hc

4.8606

1.6546

3.1856

0.0160

2.030

7.395

15

kmeans

4.4039

1.4129

2.9701

0.0260

1.734

6.236

21

rebmix

5.0984

2.0057

3.0689

0.0470

2.314

7.613

16

Results of scenario HD7a) in Table \@ref(tab:parameter-configuration-HD)  (balanced and overlapping components, with full covariance structure), with the same layout as Figure \@ref(fig:HD-separated-unbalanced-ellipsoidal-plot).

Figure 20: Results of scenario HD7a) in Table 15 (balanced and overlapping components, with full covariance structure), with the same layout as Figure 19.

Table 18: MSE and Bias associated to scenarios HD5a) and HD6a), in Table 15 (overlapping and spherical-distributed components). We delimite each scenario by doubled backslashes with respectively balanced and unbalanced clusters.

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

Success

mixtools / Rmixmod / RGMMBench

hc

4.2772 // 19.5172

0.9198 // 1.9835

3.3027 // 17.4943

0.017 // 0.069

0.995 // 0.6

3.571 // 4.381

100 // 100

kmeans

3.9776 // 17.2212

0.8279 // 1.6336

3.1111 // 15.5684

0.072 // 0.069

0.841 // 0.82

3.023 // 5.034

100 // 100

rebmix

9.3136 // 25.8028

2.7793 // 4.2893

6.4009 // 21.3519

0.15 // 0.22

3.619 // 2.507

9.061 // 11.826

96 // 80

mclust / flexmix / GMKMcharlie

hc

2.9743 // 18.1175

0.5862 // 1.7729

2.3612 // 16.3168

0.024 // 0.057

0.449 // 0.514

2.127 // 4.412

100 // 100

kmeans

2.5629 // 15.2959

0.4642 // 1.5608

2.0855 // 13.7206

0.085 // 0.086

0.671 // 1.047

1.67 // 5.801

100 // 100

rebmix

8.2907 // 23.7588

2.6468 // 4.1629

5.5421 // 19.4579

0.12 // 0.22

3.438 // 2.543

8.792 // 11.94

96 // 69

bgmm

hc

2.4088 // 33.8392

0.7261 // 9.0609

1.6153 // 24.6796

0.12 // 0.038

0.652 // 1.986

1.98 // 10.77

100 // 100

kmeans

2.0912 // 28.5103

0.5899 // 7.5426

1.4577 // 20.8989

0.091 // 0.025

0.566 // 1.45

1.738 // 9.783

100 // 100

rebmix

4.6278 // 35.9294

1.9526 // 11.0184

2.5372 // 24.6276

0.048 // 0.22

0.632 // 2.023

2.96 // 12.729

98 // 86

EMCluster

hc

2.5152 // 17.7053

0.5087 // 2.1191

1.9849 // 15.5379

0.024 // 0.12

0.321 // 0.929

1.512 // 5.611

100 // 100

kmeans

1.793 // 12.8799

0.3527 // 1.6839

1.4344 // 11.155

0.062 // 0.24

0.593 // 2.177

2.547 // 9.595

100 // 100

rebmix

6.9275 // 23.0817

2.7461 // 5.4713

4.0985 // 17.4511

0.044 // 0.32

3.177 // 3.836

8.535 // 15.437

96 // 70

HDclassif

hc

11.4938 // 49.4328

9.1746 // 12.2155

2.2913 // 36.5886

0.027 // 0.91

8.899 // 9.56

1.98 // 19.55

100 // 100

kmeans

11.1438 // 40.4749

9.0384 // 11.9946

2.0912 // 28.0385

0.096 // 0.7

9.059 // 9.024

1.682 // 16.35

100 // 100

rebmix

14.6998 // 47.2364

8.7649 // 12.6715

5.8029 // 33.929

0.22 // 0.92

8.135 // 9.145

8.018 // 21.824

96 // 70

EMMIXmfa

hc

5.6809 // 21.1181

3.7272 // 6.1206

1.7452 // 14.9126

0.41 // 0.019

5.772 // 3.645

4.299 // 12.812

96 // 45

kmeans

5.7063 // 21.3775

3.6759 // 6.589

1.79 // 14.5681

0.39 // 0.17

5.788 // 4.08

4.357 // 13.352

96 // 40

rebmix

5.8175 // 19.9703

3.8142 // 6.3202

1.7592 // 13.5389

0.35 // 0.033

5.819 // 4.402

4.349 // 13.812

93 // 34

Table 19: Minimal example setting apart MSE and Bias whether it proceeds from diagonal or offset terms of the covariance matrix, for scenarios HD5a) and HD6a), in Table 15 (overlapping and spherical-distributed components). We delimite by doubled backslashes for each entry of the summary metrics table.

Package

Initialisation Method

Global MSE Diag(Σ)

Global MSE Upper(Σ)

Global bias Diag(Σ)

Global bias Upper(Σ)

mixtools / Rmixmod / RGMMBench

hc

1.1 // 5.9

2.2 // 12

0.9194 // 2.3003

2.651 // 2.081

kmeans

0.99 // 5.6

2.1 // 10

0.8929 // 2.7422

2.13 // 2.292

mclust / flexmix / GMKMcharlie

hc

0.76 // 5.5

1.6 // 11

0.5698 // 2.418

1.557 // 1.994

kmeans

0.67 // 5.2

1.4 // 8.5

0.6909 // 3.4316

0.979 // 2.37

bgmm

hc

0.67 // 11

0.94 // 14

0.7755 // 6.9204

1.205 // 3.849

kmeans

0.58 // 9.1

0.88 // 12

0.6004 // 6.124

1.138 // 3.659

EMCluster

hc

0.62 // 6.1

1.4 // 9.5

0.4985 // 3.4685

1.013 // 2.143

kmeans

0.48 // 7.5

0.95 // 3.6

0.9269 // 7.6352

1.621 // 1.96

HDclassif

hc

0.72 // 26

1.6 // 11

0.5383 // 17.2225

1.441 // 2.328

kmeans

0.68 // 20

1.4 // 8.5

0.7156 // 13.7249

0.966 // 2.626

EMMIXmfa

hc

1.6 // 10

0.13 // 4.6

3.7632 // 10.0798

0.536 // 2.733

kmeans

1.6 // 11

0.17 // 3.9

3.7621 // 10.8057

0.594 // 2.546

We gathered on the same plot two multivariate benchmark scenarios, in which we consider a strictly spherical structure of the covariance matrix:

- We represent in Panel A and B, respectively the bivariate projection and parallel distribution plot, associated to scenario HD5a) in Table \@ref(tab:parameter-configuration-HD)  (balanced and overlapping components, with spherical covariance structure).

- In Panel C, we display the boxplots associated to scenario HD5a), computing them similarly as in Panel E of Figure \@ref(fig:HD-separated-unbalanced-ellipsoidal-plot).

- In Panel D, we display the boxplots associated to scenario HD6a) (unbalanced and overlapping components, with spherical covariance structure).

Figure 21: We gathered on the same plot two multivariate benchmark scenarios, in which we consider a strictly spherical structure of the covariance matrix:

  • We represent in Panel A and B, respectively the bivariate projection and parallel distribution plot, associated to scenario HD5a) in Table 15 (balanced and overlapping components, with spherical covariance structure).

  • In Panel C, we display the boxplots associated to scenario HD5a), computing them similarly as in Panel E of Figure 19.

  • In Panel D, we display the boxplots associated to scenario HD6a) (unbalanced and overlapping components, with spherical covariance structure).

Table 20: MSE and Bias associated to scenarios HD1a) and HD1b), in Table 15 (well-separated and spherical-distributed components). We delimite by doubled backslashes for each entry of the summary metrics table respectively the scores with n=200 and n=2000 observations.

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

Success

mixtools / Rmixmod / RGMMBench

hc

0.0577 // 0.0058

0.0288 // 0.0028

0.0264 // 0.0026

0.0097 // 0.00071

0.053 // 0.018

0.139 // 0.04

100 // 100

kmeans

0.0577 // 0.0058

0.0288 // 0.0028

0.0264 // 0.0026

0.0097 // 0.00071

0.053 // 0.018

0.139 // 0.04

100 // 100

rebmix

0.5611 // 0.0058

0.2364 // 0.0028

0.3035 // 0.0026

0.019 // 0.00071

0.372 // 0.018

0.915 // 0.04

98 // 100

mclust / flexmix / GMKMcharlie

hc

0.0577 // 0.0058

0.0288 // 0.0028

0.0264 // 0.0026

0.0095 // 0.00071

0.053 // 0.018

0.14 // 0.04

100 // 100

kmeans

0.0577 // 0.0058

0.0288 // 0.0028

0.0264 // 0.0026

0.0095 // 0.00071

0.053 // 0.018

0.14 // 0.04

100 // 100

rebmix

0.3134 // 0.0058

0.1305 // 0.0029

0.1729 // 0.0026

0.0039 // 0.0022

0.2 // 0.02

0.537 // 0.044

88 // 81

bgmm

hc

0.0577 // 0.0058

0.0288 // 0.0028

0.0264 // 0.0026

0.0098 // 0.00071

0.053 // 0.018

0.139 // 0.041

100 // 100

kmeans

0.0577 // 0.0058

0.0288 // 0.0028

0.0264 // 0.0026

0.0097 // 0.00071

0.053 // 0.018

0.139 // 0.04

100 // 100

rebmix

0.7437 // 0.1977

0.3409 // 0.0028

0.3895 // 0.1946

0.02 // 0.00034

0.308 // 0.017

1.602 // 0.926

97 // 99

EMCluster

hc

0.0577 // 0.0058

0.0289 // 0.0028

0.0264 // 0.0026

0.0093 // 0.00061

0.054 // 0.018

0.139 // 0.041

100 // 100

kmeans

0.0577 // 0.0058

0.0288 // 0.0028

0.0264 // 0.0026

0.0092 // 0.00039

0.054 // 0.017

0.139 // 0.04

100 // 100

rebmix

0.6887 // 0.3391

0.2466 // 0.1276

0.4201 // 0.1969

0.019 // 0.048

0.519 // 0.289

1.721 // 0.875

87 // 81

HDclassif

hc

11.3787 // 11.3322

11.3499 // 11.3293

0.0264 // 0.0026

0.0094 // 0.00071

11.166 // 11.179

0.139 // 0.04

100 // 100

kmeans

11.3831 // 11.3301

11.3543 // 11.3271

0.0264 // 0.0026

0.0094 // 0.00068

11.162 // 11.181

0.139 // 0.04

100 // 100

rebmix

11.5949 // 11.6085

11.5167 // 11.6055

0.072 // 0.0026

0.0024 // 0.0022

11.227 // 11.369

0.27 // 0.044

87 // 81

EMMIXmfa

hc

5.9739 // 5.9149

4.0397 // 3.9727

1.8288 // 1.8228

0.32 // 0.47

6.999 // 7.042

4.25 // 4.296

100 // 100

kmeans

5.972 // 5.8863

4.0431 // 3.9596

1.8283 // 1.8259

0.33 // 0.43

6.997 // 7.051

4.255 // 4.34

100 // 100

rebmix

5.9835 // 5.9078

4.0477 // 3.9671

1.8257 // 1.8244

0.37 // 0.46

6.994 // 7.045

4.232 // 4.301

87 // 81

Table 21: MSE and Bias associated to scenarios HD8a) and HD8b), in Table 15 (overlapping and spherical-distributed components). We delimite by doubled backslashes for each entry of the summary metrics table respectively the scores with n=200 and n=2000 observations.

Package

Initialisation Method

Global MSE p

Global MSE μ

Global MSE σ

Global Bias p

Global Bias μ

Global Bias σ

Success

mixtools / Rmixmod / RGMMBench

hc

18.6085 // 0.6735

3.566 // 0.0536

14.9495 // 0.6193

0.23 // 0.0017

3.327 // 0.107

14.475 // 0.649

100 // 100

kmeans

16.7452 // 0.6735

2.9065 // 0.0536

13.7662 // 0.6193

0.2 // 0.0016

2.819 // 0.107

12.552 // 0.649

100 // 100

rebmix

22.2986 // 0.6738

4.3127 // 0.0536

17.8418 // 0.6196

0.25 // 0.0021

3.768 // 0.108

16.249 // 0.648

95 // 100

mclust / flexmix / GMKMcharlie

hc

20.6916 // 0.728

4.5672 // 0.0696

16.0328 // 0.6557

0.28 // 0.064

4.381 // 0.459

18.656 // 2.07

100 // 100

kmeans

17.9622 // 0.7169

3.7547 // 0.0671

14.1405 // 0.6474

0.27 // 0.062

3.88 // 0.465

16.802 // 2.049

100 // 100

rebmix

22.4636 // 0.7553

4.7502 // 0.0678

17.5784 // 0.6853

0.26 // 0.0054

4.165 // 0.158

17.735 // 0.725

94 // 98

bgmm

hc

35.6085 // 13.8411

12.8826 // 3.6502

22.3428 // 10.0718

0.29 // 0.46

6.212 // 5.661

26.812 // 23.753

100 // 100

kmeans

33.8007 // 12.5545

11.7236 // 3.1419

21.7292 // 9.2934

0.28 // 0.47

6.348 // 5.654

26.546 // 23.141

100 // 100

rebmix

35.3167 // 13.106

12.2374 // 3.3747

22.6615 // 9.6213

0.37 // 0.42

6.007 // 5.273

26.287 // 23.02

96 // 100

EMCluster

hc

23.4472 // 16.4192

6.2451 // 4.9191

17.0777 // 11.3124

0.35 // 0.51

5.469 // 6.279

23.503 // 22.821

99 // 100

kmeans

21.0058 // 14.6852

5.9951 // 4.1684

14.9329 // 10.4293

0.38 // 0.42

5.604 // 6.592

24.628 // 24.706

100 // 100

rebmix

23.0408 // 19.7372

6.4923 // 7

16.3824 // 12.5099

0.36 // 0.35

5.272 // 5.454

23.419 // 23.9

93 // 98

HDclassif

hc

36.5924 // 33.4077

16.108 // 14.4007

20.3809 // 18.8638

0.3 // 0.44

12.706 // 13.085

25.393 // 29.363

100 // 100

kmeans

34.7935 // 30.1935

15.4329 // 13.78

19.2529 // 16.2816

0.4 // 0.41

12.766 // 12.756

24.988 // 25.151

100 // 100

rebmix

38.9707 // 24.0327

16.134 // 12.9266

22.6961 // 11.0138

0.25 // 0.21

12.811 // 12.275

25.79 // 15.996

95 // 98

Overview of scenarios HD1 a) and b) and HD8 a) and b) in Table \@ref(tab:parameter-configuration-HD) comparing the performance of the algorithms in respectively the easiest and most complex scenario. The left-hand column shows box plots of the estimated parameters from simulations with n=200 observations on the left and n=2000 observations on the right.

Figure 22: Overview of scenarios HD1 a) and b) and HD8 a) and b) in Table 15 comparing the performance of the algorithms in respectively the easiest and most complex scenario. The left-hand column shows box plots of the estimated parameters from simulations with n=200 observations on the left and n=2000 observations on the right.

Correlation heatmaps of the estimated parameters in the high dimensional (HD) setting extended to the three initialisation methods benchmarked (respectively hc, *k*-means and rebmix) in the most discriminating scenario HD8a), using the same process described in Figure (2).

Figure 23: Correlation heatmaps of the estimated parameters in the high dimensional (HD) setting extended to the three initialisation methods benchmarked (respectively hc, k-means and rebmix) in the most discriminating scenario HD8a), using the same process described in Figure (2).

5 Supplementary References

CRAN packages used

mixsmsn, DCEM, turboEM, otrimle, tclust, oclust, fitmix, EMMIXgene, pcensmix, mixAK, mixture, AdaptGauss, MMDvariance, fabMix, pgmm, sensory, clustvarsel, mixdist, bmixture, bpgmm, DPP, dppmix, BayesCR, polySegratioMM, RPMM, SAGMM, mixreg, fmerPack, flexmix, nlsmsn, RobMixReg, RMixtComp, clustMD, deepgmm, IMIX, Rmixmod, MGMM, SMNCensReg, ClusterR, nor1mix, MixAll, pmclust, MixtureInf, mixR, CAMAN, microbenchmark

CRAN Task Views implied by cited packages

Bayesian, ChemPhys, Cluster, Distributions, Environmetrics, MachineLearning, MetaAnalysis, MissingData, NumericalMathematics, Omics, Psychometrics, Robust, Survival

Bioconductor packages used

epigenomix, semisup, Melissa, fmrs, coseq

M. Alberich-Carramiñana, B. Elizalde and F. Thomas. New algebraic conditions for the identification of the relative position of two coplanar ellipses. Computer Aided Geometric Design, 2017. DOI 10.1016/j.cagd.2017.03.013.
L. M. Avila, M. R. May and J. Ross-Ibarra. DPP: Inference of parameters of normal distributions from a mixture of normals. 2018.
S. Bacci, S. Pandolfi and F. Pennoni. A comparison of some criteria for states selection in the latent Markov model for longitudinal data. 2012. URL http://arxiv.org/abs/1212.0352.
P. Baker. polySegratioMM: Bayesian mixture models for marker dosage in autopolyploids. 2018.
J. D. Banfield and A. E. Raftery. Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 1993. DOI 10.2307/2532201.
K. Basford, D. Greenway, G. Mclachlan, et al. Standard errors of fitted component means of normal mixtures. Computational Statistics, 1997. URL https://www.researchgate.net/publication/37625647\_Standard\_errors\_of\_fitted\_component\_means\_of\_normal\_mixtures.
T. Benaglia, D. Chauveau, D. R. Hunter, et al. mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software, 2009.
L. Bergé, C. Bouveyron and S. Girard. HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data. Journal of Statistical Software, 2012. DOI 10.18637/jss.v046.i06.
C. Biernacki, G. Celeux and G. Govaert. Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2000. DOI 10.1109/34.865189.
J. F. Bobb and R. Varadhan. turboEM: A suite of convergence acceleration schemes for EM, MM and other fixed-point algorithms. 2021.
C. Bouveyron, G. Celeux and S. Girard. Intrinsic Dimension Estimation by Maximum Likelihood in Isotropic Probabilistic PCA. Pattern Recognition Letters, 2011. DOI 10.1016/j.patrec.2011.07.017.
C. Bouveyron, S. Girard and C. SCHMID. High-Dimensional Data Clustering. Computational Statistics & Data Analysis, 2007. DOI 10.1016/j.csda.2007.02.009.
R. P. Browne and P. D. McNicholas. Estimating common principal components in high dimensions. Advances in Data Analysis and Classification, 2014. DOI 10.1007/s11634-013-0139-1.
S. Cao, W. Chang and C. Zhang. RobMixReg: Robust mixture regression. 2020.
X. Cao and J. C. Spall. Relative performance of expected and observed fisher information in covariance estimation for maximum likelihood estimates. 2012. DOI 10.1109/ACC.2012.6315584.
R. B. Cattell. The Scree Test For The Number Of Factors. Multivariate Behavioral Research, 1966. DOI 10.1207/s15327906mbr0102\_10.
G. Celeux, S. Chrétien and F. Forbes. A Component-wise EM Algorithm for Mixtures. 2012.
G. Celeux, S. Fruewirth-Schnatter and C. P. Robert. Model selection for mixture models - perspectives and strategies. 2018. DOI 10.48550/ARXIV.1812.09885.
G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis, 1992. DOI 10.1016/0167-9473(92)90042-E.
J. Chen and Z. Chen. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 2008. DOI 10.1093/biomet/asn034.
W.-C. Chen and G. Ostrouchov. Pmclust: Parallel model-based clustering using expectation-gathering-maximization algorithm for finite mixture gaussian model. 2021.
K. M. Clark and P. D. McNicholas. Oclust: Gaussian model-based clustering with outliers. 2019.
P. Coretto and C. Hennig. Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, 2016.
P. Coretto and C. Hennig. Otrimle: Robust model-based clustering. 2021.
M. Dai and A. Mukherjea. Identification of the Parameters of a Multivariate Normal Vector by the Distribution of the Maximum. Journal of Theoretical Probability, 2001. DOI 10.1023/A:1007889519309.
N. Dean, A. E. Raftery and L. Scrucca. Clustvarsel: Variable selection for gaussian model-based clustering. 2020.
M. Delattre and E. Kuhn. Estimating Fisher Information Matrix in Latent Variable Models based on the Score Function. 2019. DOI https://doi.org/10.48550/arXiv.1909.06094.
B. Efron and R. Tibshirani. An Introduction to the Bootstrap. 1993.
L. Fallah and J. Hinde. Pcensmix: Model fitting to progressively censored mixture data. 2017.
M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002. DOI 10.1109/34.990138.
J. Fonseca. The application of mixture modeling and information criteria for discovering patterns of coronary heart disease. Journal of Applied Quantitative Methods, 2008. DOI 10.1109/EMS.2008.36.
C. Fraley, A. E. Raftery and L. Scrucca. Mclust: Gaussian mixture modelling for model-based clustering, classification, and density estimation. 2022.
B. C. Franczak, R. P. Browne and P. D. McNicholas. Sensory: Simultaneous model-based clustering and imputation via a progressive expectation-maximization algorithm. 2016.
M. T. Gallegos and G. Ritter. A robust method for cluster analysis. The Annals of Statistics, 2005. DOI 10.1214/009053604000000940.
A. M. Garay, M. B. Massuia and V. Lachos. SMNCensReg: Fitting univariate censored regression model under the family of scale mixture of normal distributions. 2022.
A. M. Garay, M. B. Massuia, V. H. Lachos, et al. BayesCR: Bayesian analysis of censored regression models under scale mixture of skew normal distributions. 2017.
L. A. García-Escudero, A. Gordaliza, C. Matrán, et al. A general trimming approach to robust cluster analysis. The Annals of Statistics, 2008. DOI 10.1214/07-AOS515.
Y. He and Z. Chen. The EBIC and a sequential procedure for feature selection in interactive linear models with high-dimensional data. Annals of the Institute of Statistical Mathematics, 2016. DOI 10.1007/s10463-014-0497-2.
E. A. Houseman, Sc.D., D. C. Koestler, et al. RPMM: Recursively partitioned mixture model. 2017.
S. Iovleff. MixAll: Clustering and classification using model-based mixture models. 2019.
A. M. Iscar, L. A. G. Escudero and H. Fritz. Tclust: Robust trimmed clustering. 2022.
T. Jaki, T.-L. Su, M. Kim, et al. An evaluation of the bootstrap for model validation in mixture models. Communications in statistics: Simulation and computation, 2018. DOI 10.1080/03610918.2017.1303726.
A. T. Jones. EMMIXgene: A mixture model-based approach to the clustering of microarray expression data. 2020.
A. T. Jones and H. D. Nguyen. SAGMM: Clustering via stochastic approximation and gaussian mixture models. 2019.
C. A. Kapourani. Melissa: Bayesian clustering and imputationa of single cell methylomes. 2022.
H.-U. Klein and M. Schaefer. Epigenomix: Epigenetic and gene transcription data normalization and integration with mixture models. 2022.
A. Komárek. mixAK: Multivariate normal mixture models and mixtures of generalized linear mixed models including model based clustering. 2022.
V. Kubicki, C. Biernacki and Q. Grimonprez. RMixtComp: Mixture models with heterogeneous and (partially) missing data. 2021.
H. Kurban, M. Jenne and M. M. Dalkilic. Using data to build a better EM: EMfor big data. International Journal of Data Science and Analytics, 2017. DOI 10.1007/s41060-017-0062-1.
F. Langrognet, R. Lebret, C. Poli, et al. Rmixmod: Classification with mixture modelling. 2021.
K. M. Leytham. Maximum Likelihood Estimates for the Parameters of Mixture Distributions. Water Resources Research, 1984. DOI 10.1029/WR020i007p00896.
K.-C. Li. Sliced Inverse Regression for Dimension Reduction. Journal of the American Statistical Association, 1991. DOI 10.2307/2290563.
S. Li, J. Chen and P. Li. MixtureInf: Inference for finite mixture models. 2016.
X. Li, Y. Fu, X. Wang, et al. MMDvariance: Detecting differentially variable genes using the mixture of marginal distributions. 2018.
Y. Li and K. Chen. fmerPack: Tools of heterogeneity pursuit via finite mixture effects model. 2021.
B. G. Lindsay. Mixture Models: Theory, Geometry and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, 1995. URL https://www.jstor.org/stable/4153184.
T. A. Louis. Finding the Observed Information Matrix when Using the EM Algorithm. Journal of the Royal Statistical Society, 1982. DOI 10.1111/j.2517-6161.1982.tb01203.x.
X. Lu, Y. Li and T. Love. Bpgmm: Bayesian model selection approach for parsimonious gaussian mixture models. 2022.
P. Macdonald and with contributions from Juan Du. Mixdist: Finite mixture distribution models. 2018.
R. Maitra and V. Melnykov. Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms. Journal of Computational and Graphical Statistics, 2010. DOI 10.1198/jcgs.2009.08054.
A. Marandon, T. Rebafka, E. Roquain, et al. False clustering rate control in mixture models. 2022. DOI 10.48550/ARXIV.2203.02597.
Z. McCaw. MGMM: Missingness aware gaussian mixture models. 2021.
G. J. McLachlan, D. Peel and R. W. Bean. Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis, 2003. DOI 10.1016/S0167-9473(02)00183-4.
G. McLachlan and D. Peel. Finite Mixture Models: McLachlan/Finite Mixture Models. John Wiley & Sons, Inc., 2000. DOI 10.1002/0471721182.
G. Mclachlan and D. Peel. Mixtures of Factor Analyzers. 2000. DOI 10.48550/arXiv.1507.02801.
P. D. McNicholas, A. ElSherbiny, A. F. McDaid, et al. Pgmm: Parsimonious gaussian mixture models. 2022.
P. D. McNicholas and T. B. Murphy. Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics, 2010. DOI 10.1093/bioinformatics/btq498.
P. D. McNicholas and T. B. Murphy. Parsimonious gaussian mixture models. Statistics and Computing, 2008. DOI 10.1007/s11222-008-9056-0.
P. D. McNicholas, T. B. Murphy, A. F. McDaid, et al. Serial and parallel implementations of model-based clustering via parsimonious gaussian mixture models. Computational Statistics & Data Analysis, 2010. DOI 10.1016/j.csda.2009.02.011.
D. McParland and I. C. Gormley. clustMD: Model based clustering for mixed data. 2017.
L. Meng. Method for computation of the fisher information matrix in the expectation-maximization algorithm. 2016. DOI 10.48550/ARXIV.1608.01734.
X.-L. MENG and D. Rubin. Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika, 1993. DOI 10.1093/biomet/80.2.267.
X.-L. Meng and D. Van Dyk. The EM Algorithm–an Old Folk-song Sung to a Fast New Tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1997. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00082.
O. Mersmann. Microbenchmark: Accurate timing functions. 2021.
R. Mohammadi. Bmixture: Bayesian estimation for finite mixture of distributions. 2021.
L. Mouselimis. ClusterR: Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering. 2022.
K. Murphy. Machine Learning A Probabilistic Perspective. Adaptive Computation; Machine Learning series, 2012. DOI 10.1080/09332480.2014.914768.
E. Nowakowska, J. Koronacki and S. Lipovetsky. Tractable Measure of Component Overlap for Gaussian Mixture Models. 2014. DOI 10.48550/arXiv.1407.7172.
D. Oakes. Direct calculation of the information matrix via the EM. Journal of the Royal Statistical Society, 1999. DOI 10.1111/1467-9868.00188.
P. Papastamoulis. fabMix: Overfitting bayesian mixtures of factor analyzers with parsimonious covariance and unknown number of components. 2020.
M. Pastore and A. Calcagnì. Measuring Distribution Similarities Between Samples: A Distribution-Free Overlapping Index. Frontiers in Psychology, 2019. DOI 10.3389/fpsyg.2019.01089.
K. B. Petersen and M. S. Pedersen. The matrix cookbook. 2008.
N. Pocuca, R. P. Browne and P. D. McNicholas. Mixture: Mixture models for clustering and classification. 2022.
M. Prates, V. Lachos and C. Cabral. Mixsmsn: Fitting finite mixture of scale mixture of skew-normal distributions. 2021a.
M. Prates, V. Lachos and A. Garay. Nlsmsn: Fitting nonlinear models with scale mixture of skew-normal distributions. 2021b.
A. Rau. Coseq: Co-expression analysis of sequencing data. 2022.
A. Rauschenberger. Semisup: Semi-supervised mixture model. 2022.
R. A. Redner and H. F. Walker. Mixture Densities, Maximum Likelihood and the Em Algorithm. SIAM Review, 1984. DOI 239https://doi.org/10.1137/1026034.
C. Robert and G. Casella. Introducing Monte Carlo Methods with R. Springer, 2010. DOI 10.1007/978-1-4419-1576-4.
P. Schlattmann, J. Hoehne and M. Verba. CAMAN: Finite mixture models and meta-analysis tools - based on c.a.MAN. 2022.
G. Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, 1978. DOI 10.1214/aos/1176344136.
L. Scrucca. Dimension reduction for model-based clustering. Statistics and Computing, 2010. DOI 10.1007/s11222-009-9138-7.
L. Scrucca, M. Fop, T. B. Murphy, et al. Mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. The R journal, 2016. DOI 10.32614/RJ-2016-021.
F. Shokoohi. Fmrs: Variable selection in finite mixture of AFT regression and FMR models. 2022.
M. Thrun, O. Hansen-Goos and A. Ultsch. AdaptGauss: Gaussian mixture models (GMM). 2020.
M. E. Tipping and C. M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1999. DOI 10.1111/1467-9868.00196.
R. Turner. Mixreg: Functions to fit mixtures of regressions. 2021.
C. Viroli and G. J. McLachlan. Deepgmm: Deep gaussian mixture models. 2020.
G. R. Walsh. Methods of Optimization. Wiley, 1975.
Z. Wang. IMIX: Gaussian mixture model for multi-omics data integration. 2022.
Y. Xu, P. Mueller, D. Telesca, et al. Dppmix: Determinantal point process mixture models. 2020.
Y. Yang. Can the strengths of AIC and BIC be shared? A confict between model identification and regression estimation. Biometrika, 2005. DOI 10.1093/biomet/92.4.937.
Y. Yu. mixR: Finite mixture modeling for raw and binned data. 2021.

  1. It is equivalent to compute the MLE of a sample following distribution \(f_{\zeta_j}\) weighted by the vector of posterior probabilities.↩︎

  2. when \(\boldsymbol{A}\) is itself symmetric, as by definition, \(\boldsymbol{A}^\top=\boldsymbol{A}\)↩︎

  3. Other matrix calculus formulas and notations are available on Matrix calculus and demonstration details from The Matrix Cookbook (Petersen and Pedersen 2008).↩︎

  4. We assign “ML” to the argument method to get the biased but true MLE estimate of the covariance↩︎

  5. Langrognet et al. (2021) enforces an additional but, in our opinion, superfluous constraint that the eigen values are sorted by decreasing order↩︎

  6. Starting from eigen-decomposition described in (Equation (12)), this approach is equivalent to consider only the \(d_j\) largest eigenvalues resulting from the decomposition and sets the others to null.↩︎

  7. Although principal component analysis and factor analysis are closely related, we can differentiate both approaches by their differing objective: while PCA seeks to capture the overall variability of the dataset, factor analysis focuses on describing the intra-variability between covariates. In practice, the differences between the two approaches are minor, we can notably show that the output of PCA is one of the solutions suggested by standard factor analysis.↩︎

  8. The use of weighted distributions has more general applications. It can be used to deal with a component distribution that does not fit exactly a Gaussian shape. For instance, to deal with heavy tail distributions, more weight can be given to central components and less weight to the tails.↩︎

  9. Indeed, by simply uniformly sampling among the \(2^D\) models available, the probability of getting models with \(D/2\) features is much higher than drawing models at the boundaries, displaying either few or close to \(|D|\) covariates.↩︎

  10. Additionally, bgmm does not update the estimated variances if any newly computed variance is below the criterion stop. A remarkable side-effect of these features, as shown in Figure 2, is that the bgmm package is less sensitive to the presence of outliers.↩︎

  11. Indeed, at least one of the component was removed in \(80\%\) of our estimations in the unbalanced and overlapping case (scenario U9 in 5) and in \(20\%\) of the simulations in the unbalanced and well-separated case (scenario U3 in 5).↩︎

  12. These options are set respectively to 0 and \(+\infty\) by default, thus they did not impact our simulations↩︎

  13. In our simulation, the behaviour of the GMKMcharlie did not differ significantly from the remaining packages of the second class. However, the use of an Euclidean distance criterion may be problematic when parameters are not on the same order of magnitude, requiring their prior normalisation↩︎

  14. by default, set to two, i.e. the minimum number of replications to derive an unbiased estimate of the empirical variance of a sample↩︎

  15. To compare whether differences between mean running times or estimation performances differ across packages, we used the between-subjects Anova test rstatix::anova_test() to generate the p-values and rstatix::partial_eta_squared() to compute the corresponding effect sizes.↩︎

  16. More on the Hellinger distance as a quantifier of the similarity between two probability distributions in Hellinger distance↩︎

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Chassagnol, et al., "Supplementary note", The R Journal, 2023

BibTeX citation

@article{chassagnol-becht-nuel-benchmark-of-Gaussian-mixtures-appendix,
  author = {Chassagnol, Bastien and Bichat, Antoine and Nuel, Gregory and Becht, Etienne},
  title = {Supplementary note},
  journal = {The R Journal},
  year = {2023},
  issn = {2073-4859},
  pages = {1}
}